Part 1: Understanding the Competition#

A point of central importance is that this competition is not about mimicking or capturing what’s going on with, say, Grammarly. The rubric is very open-ended and includes nothing whatever about 1) punctuation, 2) spelling, 3) grammatical formality/correctness, 4) style, or 5) diction. From the description:

There are numerous automated writing feedback tools currently available, but they all have limitations, especially with argumentative writing. Existing tools often fail to evaluate the quality of argumentative elements, such as organization, evidence, and idea development.

This makes sense to me: grammarly, spellcheck, and so on can already assist with all of these, and the probability of a student having mastered these skills is something we can (sadly) already predict through data on race, income, zipcode, home-value, and so on. Also, as a former teacher, I can attest to the severity of the headache induced by grading thousands of documents a year. A model that could learn a rubric– whatever it consisted of– and then grade accordingly would be immensely valuable to our educators (and the immediacy of feedback to our students). Indeed, this seems to be the ambition of the project, as it states on the competition home page:

An automated feedback tool is one way to make it easier for teachers to grade writing tasks assigned to their students that will also improve their writing skills.

It goes on to specify that in addition to the things accompolished by existing writing feedback tools, they hope to develop a tool with more complexity. Indeed, the rubric (linked below) details almost philosophical criteria like ‘validity,’ ‘effectiveness,’ direction of attention, ‘stance-taking,’ ‘clarity,’ ‘relevance,’ ‘acceptability,’ ‘objectivity’, ‘soundness,’ ‘substantiation,’ and restatement. I personally think that the task of getting AI to recognize logic and valid reasoning is an extremely useful, lofty, and interesting goal. In an information-sphere awash with deep-fakes, conspiracy theories, false-leads, counterintelligence, propaganda, and general misinformation, an algorithm that could discern between sound and unsound reasoning could be of real use, whether in education or industry.

Rubric

The dataset presented here contains argumentative essays written by U.S students in grades 6-12. These essays were annotated by expert raters for discourse elements commonly found in argumentative writing:

  • Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis

  • Position - an opinion or conclusion on the main question

  • Claim - a claim that supports the position

  • Counterclaim - a claim that refutes another claim or gives an opposing reason to the position

  • Rebuttal - a claim that refutes a counterclaim

  • Evidence - ideas or examples that support claims, counterclaims, or rebuttals.

  • Concluding Statement - a concluding statement that restates the claims

In plain-language, in other words, does the argument make any sense? Imagine that the writer were instead speaking his or her argument (allowing for us to largely ignore grammar, spelling, interjections, and the like. Does the speaker stay on topic? Does the speaker state a position clearly? Does the speaker drive-home the point? Does the speaker show that A follows from B, and that factual evidence demonstrates B is true? Does the speaker consider opposing points of view, and if so, are they treated as men of straw or of steel? Assume the writer has an editor, or is using grammarly, or is just spit-balling a first-draft which will later be revised into a presentable document. Does the core structure and point come across and have any merit?

By taking a look at the rubric above, we can get a sense of exactly what the graders are looking for. The rubric consists of the types of discourse-elements, what they are characterized by, what makes them effective or ineffective, and an illustrative example-element:

Argumentation Element: Lead

Rating = Effective:
Discourse Prompt: “Should we admire heroes but not celebrities?
Description: The lead grabs the reader’s attention and strongly points toward the position.
Example: ‘ Too often in today’s society people appreciate only the circumstances which immediately profit themselves and they follow the pack, never stopping to wonder, “Is this really the person who deserves my attention?” ‘

Rating = Adequate:
Prompt: “Can people ever be truly original?”
Description: The lead attempts to grab the reader’s attention and points toward the position.
Example: ‘Originality: being able to do something that few or none have done before.’

Rating = Ineffective
Prompt: “Can people ever be truly original?”
Description: The lead may not grab the readers’ attention and may not point to the position.
Example: ‘Originality is hard to in this time and era.’

Sometimes, the examples given are quite long. For instance, when describing an effective ‘Evidence’ element, they offer the following:

Rating = Effective

Description: The evidence is closely relevant to the claim they support and back up the claim objectively with concrete facts, examples, research, statistics, or studies. The reasons in the evidence support the claim and are sound and well substantiated.

(Rather than a prompt, the ‘context’ is a previous claim made by the student): Claim: “There are a number of instances, in either our everyday lives or special occurrences, in which one must confront an issue they do not want to face.”

Example: “For instance, the presidential debate is currently going on and in order to choose the right candidate, they must research all sides of both candidates. The voter must learn all about the morals and how each one plans to better America. This might disturb some people, given that some people may either feel too strongly about a certain candidate or that they may not feel strongly enough. However, by not researching and gaining all the possible knowledge that they can, they are hurting themselves by passing up a valuable opportunity to possibly better the country for themselves and the people surrounding them.”

Ostensibly, then, the task is to 1) to hunt through a collection of examples of this type; 2) to assess them in light of these descriptions of efficacy, 3) and then predict algorithmically whether or not the text either a) contains content consistent with these descriptions or (more sublty) b) was assessed by a human rater to have satisifed the description. (More later.)

Some immediate associations I have are:

  • Integrative complexity: a term used in psychology to describe something like ‘tolerance for cognitive dissonance; ability to see two or more sides or positions; willingness to validate a legitimate point, even when it contradicts a previously held belief; serious consideration of evidence, as opposed to defensive dismissal or the issuing of rhetorical red-herrings.’

  • Sophistry: the use of language in an ostensibly coherent and persuasive way, but which is actually misleading or illogical; robust rhetorically, but substantially specious.

  • Default Bias: a concept from behavioral economics describing a tendency to fall back on a simple heuristic of the kind, ‘Unless (insert major factor) is present, I’ll just do (insert some default behavior).’ In this case, I suspect that teachers/graders are likely tired, bored, hurried, and otherwise ready to be done with this process, and thus will have some degree of a “Unless this essay really captures my attention (in a good or bad way), it’s ‘Adequate’.”

  • English Teacher Bias: a term I just made up, but which I propose to mean the instinctive negative reaction experienced by bookish, scholastic types when they see misspellings, grammtically inchoate sentences, poor punctuation, and so on. The rules don’t specify, I know, that these things are supposed to factor into the grading, but that doesn’t mean it doesn’t.

  • General Human Bias: Researchers in psychology and microeconomics have found a startling array of ‘predictably irrational’ behavior (tip o’ the hat to Dan Ariely) which humans are subject to, and often they are only tenuously linked to the topic of study. For example, judges in criminal courts have been found to sentence more harshly on the basis of how tired they are, or how long it’s been since they last ate. In general, I suspect that some of that same inconsistency will invariably be reflected in a dataset of this type.

1: Understanding the Dataset#

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import os

KFP_df = pd.read_csv("train.csv")

KFP_df.head(5)
discourse_id essay_id discourse_text discourse_type discourse_effectiveness
0 0013cc385424 007ACE74B050 Hi, i'm Isaac, i'm going to be writing about h... Lead Adequate
1 9704a709b505 007ACE74B050 On my perspective, I think that the face is a ... Position Adequate
2 c22adee811b6 007ACE74B050 I think that the face is a natural landform be... Claim Adequate
3 a10d361e54e4 007ACE74B050 If life was on Mars, we would know by now. The... Evidence Adequate
4 db3e453ec4e2 007ACE74B050 People thought that the face was formed by ali... Counterclaim Adequate
KFP_df.shape
(36765, 5)
type(KFP_df)
pandas.core.frame.DataFrame
KFP_df.isnull().sum()
discourse_id               0
essay_id                   0
discourse_text             0
discourse_type             0
discourse_effectiveness    0
dtype: int64
KFP_df.describe()
discourse_id essay_id discourse_text discourse_type discourse_effectiveness
count 36765 36765 36765 36765 36765
unique 36765 4191 36691 7 3
top 0013cc385424 91B1F82B2CF1 Summer projects should be student-designed Evidence Adequate
freq 1 23 14 12105 20977
KFP_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36765 entries, 0 to 36764
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   discourse_id             36765 non-null  object
 1   essay_id                 36765 non-null  object
 2   discourse_text           36765 non-null  object
 3   discourse_type           36765 non-null  object
 4   discourse_effectiveness  36765 non-null  object
dtypes: object(5)
memory usage: 1.4+ MB
KFP_df.discourse_effectiveness.value_counts()
Adequate       20977
Effective       9326
Ineffective     6462
Name: discourse_effectiveness, dtype: int64
testdict = dict(KFP_df['discourse_effectiveness'].value_counts())
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
myList = testdict.items()
myList = sorted(myList)
x,y = zip(*myList)
plt.bar(x, y)
plt.gcf().set_size_inches(6,4)
plt.show();
_images/Body_10_0.png
testdict = dict(KFP_df['discourse_effectiveness'].value_counts())

discSect = KFP_df['discourse_type']
discEff = KFP_df['discourse_effectiveness']
newb = pd.concat([discSect, discEff], axis = 1)
newbCT = pd.crosstab(discSect, discEff, margins = True, margins_name="total")
newbCT
discourse_effectiveness Adequate Effective Ineffective total
discourse_type
Claim 7097 3405 1475 11977
Concluding Statement 1945 825 581 3351
Counterclaim 1150 418 205 1773
Evidence 6064 2885 3156 12105
Lead 1244 683 364 2291
Position 2784 770 470 4024
Rebuttal 693 340 211 1244
total 20977 9326 6462 36765
newbCT.plot.bar()
plt.gcf().set_size_inches(8,5)
plt.show();
_images/Body_12_0.png
newbCT2 = pd.crosstab(discSect, discEff, normalize = 'index')
newbCT2
discourse_effectiveness Adequate Effective Ineffective
discourse_type
Claim 0.592552 0.284295 0.123153
Concluding Statement 0.580424 0.246195 0.173381
Counterclaim 0.648618 0.235759 0.115623
Evidence 0.500950 0.238331 0.260719
Lead 0.542994 0.298123 0.158883
Position 0.691849 0.191352 0.116799
Rebuttal 0.557074 0.273312 0.169614
plot2 = newbCT2.plot.barh(stacked = True, color = ['gold', 'forestgreen','firebrick'])
plot2.set(xlabel= 'effectiveness_proportion', ylabel = 'discourse_type')
plot2.legend(bbox_to_anchor = (1.05, .6));
_images/Body_14_0.png

Playing with Plotly (and TextStat)#

These two libraries make it almost embarassingly easy to generate interesting textual statistical summaries, as well as beautiful, interactive graphics. (Tip o’ the hat to Mr. Deepak Kaura for bringing this to my attention. Link.)

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

Here, I’m applying a bit of cleaning to the text. These methods will be explored in detail later on, but I’m temporarily doing so here to prevent some of the visuals being overly influenced by meaningless fluff and messy data. I’ll avoid modifying the original discourse_text, as some of the visuals will be more informative in their given state.

KFP_df['TextForVisuals'] = KFP_df['discourse_text'].astype(str).str.lower()
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation
KFP_df['TextForVisuals'] = KFP_df['TextForVisuals'].apply(strip_punctuation)
KFP_df['TextForVisuals'] = KFP_df['TextForVisuals'].apply(remove_stopwords)
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
KFP_df['TextForVisuals'] = KFP_df['TextForVisuals'].apply(lambda x: tokenizer.tokenize(x))
KFP_df['TextForVisuals'] = KFP_df['TextForVisuals'].apply(lambda x: " ".join([i for i in x if len(i) > 1]))
KFP_df['TFV_length'] = KFP_df['TextForVisuals'].apply(lambda x: len([i for i in x]))
KFP_df['TFV_length'].describe()
count    36765.000000
mean       134.036257
std        148.974666
min          0.000000
25%         45.000000
50%         81.000000
75%        167.000000
max       2267.000000
Name: TFV_length, dtype: float64

Relatedly, we can see that there are some massive outliers on the high-end (texts up to ~20x the mean length). These will make the graphs look all wonky, so I’m going to push the edges in towards a more normal distribution.

KFP_viz = KFP_df.copy(deep = True)
meanLength = KFP_df.TFV_length.mean()
stdLength = KFP_df.TFV_length.std()
maxLength = meanLength*3*stdLength
KFP_viz_norm = KFP_viz.loc[KFP_viz['TFV_length'] < maxLength]
import textstat

KFP_viz_norm['Reading_Time'] = 0
KFP_viz['Average_Char_per_Word'] = 0
KFP_viz_norm['Reading_Ease'] = 0
KFP_df['Average_Sen_Length'] = 0
KFP_viz['Average_Syllables_per_Word'] = 0
KFP_viz_norm['Word_Count'] = 0
for i in range(len(KFP_viz)):
    #With cleaning
    KFP_viz['Average_Char_per_Word'][i] = textstat.avg_character_per_word(KFP_viz['TextForVisuals'][i])
    KFP_viz['Average_Syllables_per_Word'][i] = textstat.avg_syllables_per_word(KFP_viz['TextForVisuals'][i])
        

    #With cleaning + outlier removal
    KFP_viz_norm['Reading_Ease'][i] = textstat.flesch_reading_ease(KFP_viz_norm['TextForVisuals'][i])
    KFP_viz_norm['Word_Count'][i] = textstat.lexicon_count(KFP_viz_norm['TextForVisuals'][i])    
    KFP_df['Average_Sen_Length'][i] = textstat.avg_sentence_length(KFP_df['discourse_text'][i])
    KFP_viz_norm['Reading_Time'][i] = textstat.reading_time(KFP_viz_norm['discourse_text'][i])

Note that the ‘Average Sentence Length’ uses the original text in its original form, as the sentence length would not be computable if we’d stripped punctuation; I am not using the processed text for ‘Reading Time’, but I am using the trimmed dataset; I am using processed and trimmed sets for the other values. I think that reading time and sentence length values will be more informative if we include all of the information about punctuation and stop-word usage.

import plotly.express as px
fig=px.histogram(data_frame=KFP_viz_norm,
                 x=KFP_viz_norm.Reading_Time,
                 marginal="violin",
                 color=KFP_viz_norm.discourse_type)

#Reading time is a statistic from textstat that estimates how long reading the text would take, assuming ~15ms per character.
fig.update_layout(title="Reading Time Variation with Respect to All Discourse Types:",
                  titlefont={'size': 25},
                  template='plotly_white'     
                  )
fig.show()

So, here we can, for example, see that evidence tends to take the most time to read (on average), while there are quite a lot of claims in comparison to, say, positions.

import plotly.express as px
fig=px.histogram(data_frame=KFP_viz_norm,
                 x=KFP_viz_norm.Reading_Ease,
                 marginal="violin",
                 color=KFP_viz_norm.discourse_type)

#See link below-- a metric of textual difficulty (negative values indicate extremely hard to make sense of)
fig.update_layout(title="Text Difficulty Variation with Respect to All Discourse Types:",
                  titlefont={'size': 25},
                  template='plotly_white'     
                  )
fig.show()

Here, the outliers are in the negative direction (at least of what’s left). The link below describes both the calculation this uses as well as the interpretation of the various scores. Roughly, the higher the score, the easier the text is to read, with a score of 90 being roughly commensurate with a 5th-grade reading level. My suspicion (given that the Harvard Law Review scores in the 30s on this scale) is that those descending from the peak of the curve are increasingly using bad punctuation, misspelled words, run-on sentences, and general incoherence, making them pretty inscrutable. Read more.

fig=px.histogram(data_frame=KFP_viz,
                 x=KFP_viz.Average_Char_per_Word,
                 marginal="violin",
                 color=KFP_viz.discourse_type)

fig.update_layout(title="Character Count Variation with Respect to Discourse Types:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()
temp = KFP_viz.groupby('discourse_type').count()['discourse_id'].reset_index().sort_values(by='discourse_id',ascending=False)
fig = go.Figure(go.Funnelarea(
    text =temp.discourse_type,
    values = temp.discourse_id,
    title = {"position": "top center", "text": "Funnel-Chart of discourse_type Distribution"}
    
    ))

fig.update_layout(font = {'size' : 20})
fig.show()
fig = px.bar(x=np.unique(KFP_viz["discourse_type"]), 
             y=[list(KFP_viz["discourse_type"]).count(i) for i in np.unique(KFP_viz["discourse_type"])], 
             color=np.unique(KFP_viz["discourse_type"]), 
             color_continuous_scale="Mint")
fig.update_xaxes(title="Classes")
fig.update_yaxes(title="Number of Rows")
fig.update_layout(showlegend=True, 
                  title={
                      'text':'Discourse Type Distribution', 
                      'y':0.95, 
                      'x':0.5, 
                      'xanchor':'center', 
                      'yanchor':'top'}, template="seaborn")
fig.show()
fig=px.histogram(data_frame=KFP_df,
                 x= KFP_df['Average_Sen_Length'],
                 marginal="violin",
                 color=KFP_df.discourse_type)

fig.update_layout(title="Average Sentence Length for All Discourse Types:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

So, here, despite keeping punctuation and all intact, there are apparently a few texts that have virtually no periods, exclamations, or question marks, and just run on for 200-600 characters, distorting the dataset.

Trimmed_ASL = KFP_df.loc[(KFP_df['Average_Sen_Length'] <200) & (KFP_df['discourse_type'])]
fig=px.histogram(data_frame=Trimmed_ASL,
                 x= Trimmed_ASL['Average_Sen_Length'],
                 marginal="violin",
                 color=Trimmed_ASL.discourse_type)

fig.update_layout(title="Average Sentence Length for All Discourse Types (Trimmed):",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()
fig=px.histogram(data_frame=KFP_viz,
                 x=KFP_viz.Average_Syllables_per_Word,
                 marginal="violin",
                 color=KFP_viz.discourse_type)

fig.update_layout(title="Average Number of Syllables per Word with Respect to Discourse Type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()
fig=px.histogram(data_frame=KFP_viz_norm,
                 x=KFP_viz_norm.Word_Count,
                 marginal="violin",
                 color=KFP_viz_norm.discourse_type)

fig.update_layout(title="Word Count Distribution By Discourse Type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

So, for the most part, our dataset appears to be complete, properly formatted, and ready for analysis. We aren’t given much– the text, and what type it belongs to. It is possible to reconstruct the entire students’ essay (in some cases), and I believe that some of the winning models included those kinds of analyses in their models. However, I couldn’t think of any helpful way to do so, so we won’t look at those until the end.

Part 3: Introductory Methods in Natural Language Processing#

Resolving Encoding

Since the data is in such good shape, there’s not a whole lot that we have to do in terms of fixing missing values, imputation, reformatting, and so on. A few items, however, will need to be addressed.

Firstly, prior to this competition, I did not realize how many different styling indices existed for rendering text: utf, ACSII, unicode, and so on. While this didn’t prove to be a huge problem, there were some oddball texts I discovered, and the following functions resolve all of that to ensure a smooth execution later on. (Tip o’ the hat to Mr. DSML on Kaggle for providing the template: Link.)

from typing import Dict, Tuple, List
import codecs
from text_unidecode import unidecode
def replace_encoding_with_utf8(error: UnicodeError) -> Tuple[bytes, int]: return error.object[error.start : error.end].encode("utf-8"), error.end
def replace_decoding_with_cp1252(error: UnicodeError) -> Tuple[str, int]: return error.object[error.start : error.end].decode("cp1252"), error.end
codecs.register_error("replace_encoding_with_utf8", replace_encoding_with_utf8)
codecs.register_error("replace_decoding_with_cp1252", replace_decoding_with_cp1252)
def resolve_encodings_and_normalize(text: str) -> str:
    text = (text.encode("raw_unicode_escape").decode("utf-8", errors = "replace_decoding_with_cp1252").encode("cp1252", errors = "replace_encoding_with_utf8").decode("utf-8", errors = "replace_decoding_with_cp1252"))
    text = unidecode(text)
    return text
    
KFP_df.discourse_text = KFP_df.discourse_text.apply(lambda x: resolve_encodings_and_normalize(x))

Before we do much else, I want to go ahead and make a copy of the text in its original state, so that I can have it available for the feature engineering and other analyses prior to model-building, but after cleaning and processing the text.

KFP_df['discourse_copy'] = KFP_df.discourse_text

In the process of completing this project, I encountered a great-many novel terms and concepts, both from the fields of linguistics and cognitive science, as well as from computer science, programming, and data analytics. Likewise, tricks-of-the-trade that would likely have never occurred to me to attempt were abundant, and I’ll review and demonstrate some of those here.

Firstly, let’s import some of the more popular NLP-specific modules:

  • NLTK: Natural Language Toolkit

  • Gensim

  • TextBlob & Spacy (I don’t use them here, but I could have – their functionality and implementation are very similar, so check it out if you’re interested)

And I assume anyone reading this is already familiar with pickle, regex, and string.

import nltk
import string
import pickle as pkl
import gensim
import regex as re

One of the first thoughts I had when I started working on this was about how to deal with contractions. If you’re not careful, for one thing, when you .strip and .split and so on in Pandas, you can inadvertently create some oddball text-items. But furthermore, I don’t see out-of-hand why contractions couldn’t have some statistical implications. One module that I found I and liked is called ‘contractions’, and can be used in the following way:

import contractions
Demo = "Don't or Shouldn't or Can't or Won't?"
print(f" Original: {Demo}")
print('\n',f'Original (Split): {Demo.split()}')
Demo2 = strip_punctuation(Demo)
print('\n',f"Gensim Version: {Demo2}")
print('\n', f"Gensim (Split_PrePunctuation): {Demo}")
print('\n',f"Gensim (Split_PostPunctuation): {Demo2.split()}")
Demo3 = contractions.fix(Demo)
print('\n', f"Contractions Module: {Demo3}")
print('\n', f"Contractions Module (Split): {Demo3.split()}")
 Original: Don't or Shouldn't or Can't or Won't?

 Original (Split): ["Don't", 'or', "Shouldn't", 'or', "Can't", 'or', "Won't?"]

 Gensim Version: Don t or Shouldn t or Can t or Won t 

 Gensim (Split_PrePunctuation): Don't or Shouldn't or Can't or Won't?

 Gensim (Split_PostPunctuation): ['Don', 't', 'or', 'Shouldn', 't', 'or', 'Can', 't', 'or', 'Won', 't']

 Contractions Module: Do not or Should not or Cannot or Will not?

 Contractions Module (Split): ['Do', 'not', 'or', 'Should', 'not', 'or', 'Cannot', 'or', 'Will', 'not?']

In other words, using the contractions module allows you to completely delete punctuation, while not deleting any words, not retaining misspelled words, and not splitting the words along the apostrophe. I doubt it makes too much difference at the final analysis stage, but I like to have as much control as possible over what I keep and what I delete.

The next step is to strip out all of the punctuation marks, oddball characters, and capital letters.

import contractions
contractions.add("i'm", 'i am')
KFP_df.discourse_text = KFP_df.discourse_text.apply(contractions.fix)

from gensim.parsing.preprocessing import strip_punctuation
KFP_df.discourse_text = KFP_df.discourse_text.apply(strip_punctuation)

KFP_df.discourse_text = KFP_df.discourse_text.apply(lambda x: x.lower())
KFP_df.discourse_text = KFP_df.discourse_text.apply(lambda x: x.strip())

from gensim.parsing.preprocessing import strip_multiple_whitespaces
KFP_df.discourse_text = KFP_df.discourse_text.apply(strip_multiple_whitespaces)
#Output
KFP_df.discourse_text[1]
'on my perspective i think that the face is a natural landform because i do not think that there is any life on mars in these next few paragraphs i will be talking about how i think that is is a natural landform'

Now, since any text model that we will ultimately use for classification will a vectorized version of any features we pass, it is necessary to remove what are known as stopwords: words that are ubiquitous, words that add little-to-no value or potential for distinction, and so on. These words will create a lot of noise and chaos in the final vector space, ultimately pointing our learner off-target.

from gensim.parsing.preprocessing import remove_stopwords
print(gensim.parsing.preprocessing.STOPWORDS)
frozenset({'us', 'side', 'un', 'everything', 'could', 'behind', 'any', 'three', 'here', 'whereupon', 'herein', 'thereafter', 'on', 'moreover', 'always', 'do', 'nine', 'thereby', 'move', 'until', 'well', 'ie', 'further', 'among', 'and', 'has', 'cry', 'is', 'cant', 'onto', 'i', 'also', 'it', 'or', 'from', 'still', 're', 'cannot', 'one', 'wherein', 'over', 'are', 'while', 'without', 'anywhere', 'anyway', 'rather', 'became', 'hereupon', 'bill', 'co', 'whence', 'our', 'am', 'unless', 'please', 'whereas', 'couldnt', 'made', 'system', 'somewhere', 'he', 'mill', 'most', 'enough', 'thus', 'together', 'eight', 'anyhow', 'as', 'whole', 'each', 'both', 'whoever', 'an', 'make', 'therein', 'everywhere', 'same', 'indeed', 'sincere', 'etc', 'the', 'whatever', 'two', 'during', 'were', 'become', 'their', 'ourselves', 'must', 'wherever', 'so', 'under', 'fifty', 'per', 'never', 'nor', 'own', 'your', 'how', 'interest', 'twelve', 'yet', 'along', 'latterly', 'for', 'him', 'amount', 'whose', 'next', 'herself', 'a', 'becomes', 'will', 'whereafter', 'yours', 'all', 'its', 'above', 'thin', 'didn', 'others', 'anyone', 'somehow', 'if', 'seemed', 'computer', 'does', 'nobody', 'sometime', 'eg', 'nevertheless', 'down', 'whom', 'alone', 'otherwise', 'back', 'been', 'sixty', 'afterwards', 'serious', 'really', 'his', 'few', 'she', 'everyone', 'around', 'some', 'because', 'no', 'meanwhile', 'should', 'being', 'third', 'anything', 'in', 'thence', 'these', 'via', 'after', 'eleven', 'show', 'towards', 'using', 'whenever', 'put', 'than', 'bottom', 'see', 'empty', 'seeming', 'by', 'had', 'ever', 'thick', 'sometimes', 'below', 'can', 'something', 'many', 'those', 'inc', 'have', 'ours', 'don', 'various', 'fire', 'give', 'already', 'almost', 'that', 'perhaps', 'did', 'every', 'full', 'besides', 'however', 'why', 'forty', 'only', 'hence', 'doesn', 'this', 'name', 'yourselves', 'very', 'used', 'up', 'except', 'hereby', 'find', 'therefore', 'seem', 'beforehand', 'there', 'kg', 'fill', 'but', 'would', 'say', 'several', 'thru', 'beside', 'which', 'nowhere', 'who', 'ten', 'beyond', 'you', 'five', 'about', 'such', 'part', 'too', 'con', 'noone', 'at', 'call', 'quite', 'take', 'whether', 'mostly', 'of', 'out', 'them', 'between', 'into', 'twenty', 'more', 'fifteen', 'other', 'none', 'becoming', 'we', 'found', 'within', 'latter', 'often', 'may', 'toward', 'her', 'throughout', 'last', 'namely', 'hundred', 'hasnt', 'de', 'though', 'ltd', 'hereafter', 'whither', 'even', 'get', 'elsewhere', 'himself', 'although', 'just', 'someone', 'now', 'else', 'less', 'off', 'hers', 'my', 'done', 'another', 'former', 'seems', 'detail', 'be', 'describe', 'much', 'thereupon', 'where', 'then', 'across', 'was', 'me', 'km', 'upon', 'six', 'against', 'nothing', 'they', 'neither', 'amongst', 'go', 'what', 'front', 'top', 'amoungst', 'whereby', 'mine', 'again', 'four', 'keep', 'when', 'through', 'myself', 'itself', 'once', 'first', 'formerly', 'might', 'doing', 'least', 'not', 'regarding', 'due', 'yourself', 'to', 'with', 'before', 'either', 'themselves', 'since'})

Some of these words, as you can see, are pretty unsurprising: a lot of prepositions (‘of’, ‘in’), a lot of prefixes/affixes (‘un’, ‘re’), a lot of pronouns (‘she’, ‘him’), a lot of psuedo-words (‘ltd’, ‘co’, ‘ie’), and a lot of words that just appear in everything (‘are’, ‘did’, ‘just’, ‘well’, ‘but’, and so forth).

On the other hand, some of them are odd to me (‘latterly’?), Shakespearean (‘hereupon’, ‘whence’), numeric (‘fifty’, ‘eight’), or otherwise just…not what I expected (‘mill’, ‘computer’). In any event, I presume the authors of the list have spent enough time researching and curating this list to have made a pretty good case for their inclusion. You can read more about the development of the list (at least according to Gensim) in ‘Stone, Denis, Kwantes (2010).’

KFP_df['discourse_text'] = KFP_df['discourse_text'].apply(remove_stopwords)

The next step is to ‘tokenize’ the text. (Overly) Simply put, this means parsing the sentence into discrete lexemes (words) or graphemes (symbols, notation) so as to make the text amenable to later forms of processing or analysis. The RegexpTokenizer, courtesy of NLTK, uses regular expressions as the boundary at which it cuts the string. Some models, it should be mentioned, are optimized to work with a model-specific tokenizer, so select accordingly.

tokenizer = RegexpTokenizer(r'\w+')
KFP_df['discourse_tokens'] = KFP_df.discourse_text.apply(tokenizer.tokenize)
KFP_df['discourse_tokens'][22]
['change',
 'emotion',
 'students',
 'feeling',
 'help',
 'students',
 'education',
 'waste',
 'time']

Frequency Analysis is the process of analyzing documents for the frequency with which they contain some object of interest. In cryptography, this is used to decipher characters or words based on the probability of that word appearing in a given target text’s language, for example. Here, we’ll explore briefly the popularity of words in the text, and what – if any – implications they carry for this analysis.

from nltk.probability import FreqDist
#join tokens into strings                                                         #Note: only tokens greater than 2 char.
KFP_df['frequency_strings'] = KFP_df['discourse_tokens'].apply(lambda x: ' '.join([item for item in x if len(item) > 2]))
#join ALL strings into a single string
discourse_words = " ".join([word for word in KFP_df['frequency_strings']])
#tokenize this monstrosity
document_tokens = tokenizer.tokenize(discourse_words)
#Make a counter dict from this composite string
fdist = FreqDist(document_tokens)
fdist.most_common(10)
[('students', 12246),
 ('people', 10610),
 ('school', 7427),
 ('electoral', 7099),
 ('college', 6035),
 ('vote', 5963),
 ('like', 5845),
 ('think', 4927),
 ('time', 4769),
 ('help', 4554)]
top25 = fdist.most_common(25)
series_top25 = pd.Series(dict(top25))
fig = px.bar(y = series_top25.index,
            x = series_top25.values,)

fig.update_layout(barmode = 'stack',
                 yaxis = {'categoryorder':'total ascending'})

fig.show()

We can see clearly that most of the words^ are either a) topical, specific to the essay prompt (e.g., ‘states’, ‘votes), b) pretty-much-stop-words (‘like’), or c) so general and ubiquitous they’ll likely do more harm than good in signal-to-noise terms (‘know’, ‘better’). We can deal with this by adding some additional qualifications to our token filter. We can also eliminate some oddball words that appear only once in the entire dataset, words that appear constantly in the datset, and anything that’s not a word but a number.

(^If you want to bump-up your credibility in a crowd of non-NLP people, say ‘Unigrams,’ not ‘words.’ An ‘N-gram’ is the formal way of referring to a subsection of text that has length N: for example a ‘trigram’ is a subset of three words– excuse me, unigrams– like ‘who even cares’ or ‘highfalutin academic talk’ and so on.)

topical_unigrams = ['mona', 'lisa', 'electoral', 'college', 'landform', 'project', 'projects', 'venus', 'electors', 'phones',
                   'learning', 'votes', 'aliens', 'congress', 'constitutional', 'president', 'vote', 'classes', 'online',
                   'summer', 'election', 'students', 'student', 'school', 'extracurricular', 'life', 'mars', 'disctrict', 
                    'columbia', 'kerry', 'ran', 'national']

more_stopwords = ['like', 'the', 'i', 'want']

topical_unigrams+= more_stopwords

KFP_df.discourse_tokens = KFP_df.discourse_tokens.apply(lambda x: \
                                                        [token for token in x if not token.isnumeric() \
                                                         and fdist[token] >1 \
                                                         and fdist[token] < 25000 \
                                                         and token not in topical_unigrams]) 

Finally, in the still-further interest of clearing out noise and reducing dimensionality, a common practice is to either ‘Lemmatize’, ‘Stem’, or ‘Both’ (kidding) the tokens. Both of these processes involve some melange of simplifying/affix-removing/de-conjungating/etc. the words to an infinitive form, a ‘root’, or something like that. As you can likely tell, I’m not 100%-clear myself on what the rules are, but some illustrative examples can clarify. (Tip o’ the hat to the folks over at Turing.com for this helpful guide: Link.)

from nltk.stem import PorterStemmer
PS = PorterStemmer()
from nltk.stem.snowball import SnowballStemmer
SS = SnowballStemmer(language= 'english')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
illustrative_words = ['plays', 'playing', 'pharmacy', 'pharmacies', 'programmatic', 'badly']

print(f"Original: {illustrative_words}")
print('\n'f"Porter Stemmer-ed: {[PS.stem(i) for i in illustrative_words]}")
print('\n'f"Snowball Stemmer-ed: {[SS.stem(i) for i in illustrative_words]}")
print('\n'f"Word-Net-Lemmatizer-ated: {[lemmatizer.lemmatize(i) for i in illustrative_words]}")
Original: ['plays', 'playing', 'pharmacy', 'pharmacies', 'programmatic', 'badly']

Porter Stemmer-ed: ['play', 'play', 'pharmaci', 'pharmaci', 'programmat', 'badli']

Snowball Stemmer-ed: ['play', 'play', 'pharmaci', 'pharmaci', 'programmat', 'bad']
Word-Net-Lemmatizer-ated: ['play', 'playing', 'pharmacy', 'pharmacy', 'programmatic', 'badly']

Since the computer largely interprets – at least in some models – the word-vectors (more later) for ‘pharmacy’ and ‘pharmacies’ as equivalent, the stemmer reduces noise and redundancy by abbreviating the word to some kind of least-common-denominator such that whatever it is paying attention to is identical for both. As I mentioned above, though, the exact process needed will depend, in part, on the model you choose and what you’re trying to do. In this case, I’m going to stick with the lemmatized version since it seems to retain the most meaning, and since the stemmed versions print out so badli.

lemmatizer = WordNetLemmatizer()
for token_list in KFP_df.discourse_tokens:
    for token in token_list: 
        lemmatizer.lemmatize(token)
KFP_df.discourse_tokens.head(5)
0    [going, writing, face, natural, story, nasa, t...
1    [perspective, think, face, natural, think, par...
2                   [think, face, natural, descovered]
3    [know, reason, think, natural, live, order, cr...
4    [people, thought, face, formed, alieans, thought]
Name: discourse_tokens, dtype: object

A word cloud is an absolutely worthless graphic in terms of analytics, but the wordcloud module makes it super easy to do, and they look kind of cool, so what the heck?

%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

lemmat_strings = KFP_df.discourse_tokens.apply(lambda x: ' '.join([i for i in x]))
lemmatized_tokens = ' '.join([i for i in lemmat_strings])

wordcloud = WordCloud(width = 600,
                     height = 400,
                     random_state = 2,
                     max_font_size = 100).generate(lemmatized_tokens)

plt.figure(figsize= (10, 7))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off');
_images/Body_66_0.png

Part 4: Hypothesis-Generation And Feature Engineering#

With all of the text parsing and preprocessing out of the way, we have a prepared set of data that’s amenable to analysis. However, we don’t have all that much non-textual content to work with. The statistics given by the textstat module above may be useful, but I have some suspicions that I’d like to explore at least, to see if we can squeeze anything further out of this. First off, I want to rid my df of the irrelevant features we created above.

Punctuation Mark Analysis:#

While the rubric doesn’t state that punctuation is considered in grading, that doesn’t mean that it isn’t. As discussed, all human graders are biased, and most are lovers of language and literature (at least essay-graders). It’s hard to resist the tug of emotional-aesthetic revulsion when faced with horrid grammar. Likewise, the things the rubric does specify (complexity, valid argumentation, supportive evidence, clear logic, and so on) are, I’d wager, positively correlated with factors of intelligence, literacy, breadth-of-exposure to good writing, and attentiveness to school and detail. Thus the effective use of punctuation may not be why an excerpt scored highly, but it may still contribute to the recognition of such items.

  1. Text-style (lol ‘textile’) writing– as in ‘text messaging’– is frequently CRAZY!!!!! LIKE THIS!!!!!!!!!!! There’s a reasonable number of exclamation points to use in a text of x-length, and that number may well be zero. But it would be odd if every sentence ended with one! Tell me I’m wrong! See how strange this is?! Thus, I’d guess their excessive use is an indication of weak writing.

KFP_df['exclamation_count'] = KFP_df.discourse_copy.apply(lambda x: len([i for i in x if i == '!']))
  1. As Abe Lincoln never said, “The thing about quotation marks is they come in pairs.” It seems to me that an odd number of quotation marks is a bad sign… Furthermore, the rubric specifies that effective excerpts reference external material and grab attention with witticisms and sagacicities. If one is using quotation marks (in pairs, at least), that probably means they are, well, quoting someone. Since this is specifically identified as valuable, we’ll check for it.

KFP_df['quotation_marks'] = KFP_df.discourse_copy.apply(lambda x: len([i for i in x if i == '"']))
  1. In general, at least one or two of the following marks should be present in pretty much any effectively written sentence. That said, their presence doesn’t indicate much, but their absence does. Furthermore, these should increase in frequency more-or-less linearly with text-length. You simply can’t, and shouldn’t, avoid using apostrophes to annotate contractions, commas to separate clauses, and periods to terminate sentences. Even terse sentences require periods. A low-ratio of marks like these to words used means– almost certainly– run-on sentences, incoherent babble, and plain-ol’ bad writing that you dont won’t to be guilty of doin.

basic_punctuation_marks = ['?',
                           ',',
                           "'",
                          '.']

KFP_df['basic_punctuation'] = KFP_df.discourse_copy.apply(lambda x: \
                                                                     len([i for i in x if i in basic_punctuation_marks]))
  1. These punctuation marks are more subtle, and they tend to be underused. There are complicated rules dictating the use of colons: sometimes they are used for abrupt direction-changes, other times for implied conclusions, and yet other times for making lists. Parentheses are also tricky (and underused). (They’re especially finnicky when used in standalone sentences.) Semi-colons and hypens aren’t easy either; many writers mis-use them, or neglect them entirely. Likewise, the rubric suggests making reference to statistics in support of evidence, as well as in an attention-grabbing lead, but I’d reckon no more than 25% of students did.

effective_punctuation_marks = ['-',
                               '(',
                               ')',
                                '%',
                               ':',
                               ';'
                              ]
                               

KFP_df['positive_punctuation'] = KFP_df.discourse_copy.apply(lambda x: \
                                                                     len([i for i in x if i in effective_punctuation_marks]))
  1. In much the same way as above, I think it’s reasonable to think that the mere presence of a lot of punctuation is a good thing.

punctuation_marks = string.punctuation

def punctuation_diversity(text):
    score = 0
    marks = set(text)
    punctuation_markz = list(punctuation_marks)
    
    for mk in punctuation_markz: 
        if mk in marks:
            score+= 1 
    
    return score
KFP_df['punctuation_diversity_text'] = KFP_df['discourse_copy'].apply(lambda x: punctuation_diversity(x))

Lexical Analysis:#

I think a safe and simple premise, at least for exploratory analysis, is that some degree of probability that the essay/element has performed well or poorly can be revealed by the words used by the author. The following are some collections of words that I propose carry relevance to that end.

  1. Complex Unigrams:I mentioned above that the competition guidelines put me in mind of Integrative Complexity. To my mind, certain words and ideas suggest this more distinctively than others. For example, words like ‘whereas,’ ‘however,’ and ‘perspective’ suggest that perhaps the author is considering counterarguments or taking other views into account. Words like ‘therefore,’ ‘furthermore,’ and ‘prove,’ suggest an argument is being built. ‘Sources,’ ‘quote,’ and ‘research’, ‘percentage,’ and so on suggest reference is being made to credible, external material.

complex_unigrams= ['procrastinate', 'whereas', 'detrimental', 'grasp', 'pursue', 'additionally', 'irrelevant', 'critical',
          'motivation', 'thus', 'however', 'assignments', 'assignment', 'source', 'sources', 'interests', 'asking', 'given',
          'skills', 'education', 'educational', 'ability', 'schooling', 'beneficial', 'allowing', 'instance', 'example',
            'therefore', 'excel', 'concepts', 'perspective', 'contradictory', 'allows', 'solutions', 'direction','forgoing',
            'view', 'viewpoint', 'confirmatory', 'confirm', 'evidence', 'proof', 'disprove', 'prove', 'contradict',
            'challenge', 'challenging', 'contradict', 'mislead', 'demonstrate', 'misleading', 'compromise', 'resolve']
def complexity_eff(x):
    score = 0
    for item in x:
        if item in complex_unigrams:
            score +=1
    return score

KFP_df['count_E_tokens']= KFP_df['discourse_tokens'].apply(lambda x: complexity_eff(x))
  1. Tricky Spelling: A quick Google search – along with self-reflective composition – shows that not all words are created equally in terms of easiness-to-spell. Since they are so frequently gotten wrong, I doubt that a human reviewer would penalize their inaccuracy (assuming they even noticed). However, if they are spelled correctly that suggests the author was either a) quite lucky or b) sufficiently verbally agile to get the spelling correct. In the latter case, correct spellings may be a sign of superior literacy.

tricky_spells_correct = ['absence', 'address', 'access', 'believe', 'beginning', 'privilege', 'separate', 'license',
                        'necessary', 'height', 'foreign', 'essential', 'receive', 'receiving', 'focused', 'though',
                        'through', 'unique', 'experience', 'experiences', 'occur', 'success', 'field', 'views', 'achieve']
def spelling_eff(x):
    score = 0
    for item in x:
        if item in tricky_spells_correct:
            score +=1
    return score


KFP_df['effective_spelling']= KFP_df['discourse_tokens'].apply(lambda x: spelling_eff(x))
  1. Not-So-Tricky Spelling: In contrast to (some) of the words above, some words are either a) not difficult to spell, or b) find misspellings that defy the imagination. I-before-E-trips-up-even-me (on occasion), but win sumwun spels perticlurlee bad, like a Chik-Fil-A cow, it’s harder to excuse.

misspells = ['absense', 'adress', 'alot', 'beleive', 'cieling', 'calendur', 'begining', 'experiance', 'embarass', 'sience', 
            'seperate', 'wierd', 'truely', 'independant', 'goverment', 'hieght', 'foriegn', 'greatful', 'enviroment', 
             'privelege' 'libary', 'lisense', 'misterious', 'neccessary', 'peice', 'nieghbor', 'peolpe', 'electorals', 'stuf', 
             'alot', 'stuff', 'think', 'whats', 'pharaoh', 'activitys','ther', 'gonna', 'beacause', 
             'actaully', 'somone', 'selves', 'driveing', 'paragragh', 'moc', 'aswell']
def sloppy_spelling(x):
    score = 0
    for item in x.split():
        if item in misspells:
            score += 1
    return score

KFP_df['count_misspelled_tokens']= KFP_df['discourse_copy'].apply(lambda x: sloppy_spelling(x))
  1. Contraction Reaction: I dont think yall need something thats this obvious to be explained further, aint that right? Cant really help yourself…

Bad_Contractions = ['cant', 'wont', 'isnt', 'aint', 'dont', 'werent', 'wernt', 'doesnt', 'thats', 'arent', 'couldnt',
                   'wouldnt', 'didnt', 'hadnt', 'Im', 'im', 'shouldnt', 'shes', 'hes', 'lets', 'id', 'hed', 'havent', 'ill',
                   'ive', 'Ive', 'Ill', 'theres', 'theyd', 'theyre', 'theyll', 'weve', 'wed', 'youve', 'youre', 'youd',
                    'whos', 'whove', 'wheres', 'whats']

KFP_df['contraction_errors_text'] = KFP_df['discourse_copy'].apply(lambda x: \
                                                                   len([i for i in x.split() if i in Bad_Contractions]))
  1. Conservative Unigrams: This is a bit of a controversial addition, but I figured it was worth looking at. Given this is a university-sponsored competition, given many of the essays appeared to discuss political topics, and given increasing political division in America, I don’t think it’s unreasonable to suspect that politically conservative words might be associated with a strong reaction (negative or positive).

conservative_unigrams = ['conservative', 'bible', 'moral', 'christian', 'god', 'liberty', 'trump']

def conservatism(x):
    score = 0
    for item in x: 
        if item in conservative_unigrams:
            score+=1 
    return score

KFP_df['conservatism_score'] = KFP_df['discourse_tokens'].apply(lambda x: conservatism(x))
  1. Overused Unigrams: It seems to me that an abundant presence of vague, hackneyed, or filler-type words might be a sign of poor rhetorical skill, unclear thinking, a lack of something substantial to say, or all of the above.

basic_unigrams = ['anybody', 'guys', 'basically', 'conclusion', 'like', 'the', 'want', 'i', 'very', 'whatever']

def complexity_inef(x):
    score = 0
    for item in x:
        if item in basic_unigrams:
            score += 1
    return score

KFP_df['basic_unigrams']= KFP_df['discourse_tokens'].apply(lambda x: complexity_inef(x))
  1. Frequency Analysis: The frequency dictionary generated above didn’t show any distinctive difference between the vocabularies of ‘Adequate’ essays and the vocabularies of ‘Ineffective’ ones. However, the ‘Effective’ essays had several hundred words that didn’t appear in either of the other ratings. Some are no doubt idiosyncratic, topical, or otherwise not applicable to the test-set texts. However, the following list is some of which I suspect are more generalizable.

E_Additions = ['shown', 'systems', 'essential', 'effects', 'struggling', 'unable', 'bias', 'biased', 'notes',
              'pursue', 'sources', 'perspective', 'perspectives', 'previous', 'interaction', 'concept',
              'occur', 'success', 'solve', 'field', 'cases', 'available', 'general', 'quality', 'directly', 
              'strong', 'additionally', 'flaws', 'flaw', 'flawed', 'aspect', 'include', 'influence', 
               'relationships', 'considering', 'crucial', 'efficient', 'resources', 'resource', 'provide',
              'provides', 'furthermore', 'exploration', 'results', 'fully', 'ultimately', 'potentially']
def E_xclusive(x):
    score = 0
    for item in x:
        if item in E_Additions:
            score += 1
    return score

KFP_df['E_Exclusive_tokens']= KFP_df['discourse_tokens'].apply(lambda x: E_xclusive(x))
  1. Bigrams: As I started to think about bigram analysis, I quickly realized how out-of-hand this could get, so I’ve opted to only point to a handful that I think are particularly strong. (Plus this makes my computer run slowly to process, and that drives me bonkers.)

bigram_list = ['according to',
          'for example',
           'for instance',
           'may claim',
           'despite that',
           'fair point',
           'valid point',
           'valid objection',
           'research shows',
               'light of',
               'you might',
               'they think']

tuple_list = []
for i in bigram_list:
    b = tuple(i.split())
    tuple_list.append(b)
KFP_df['bigrams'] = KFP_df['discourse_copy'].apply(lambda x: list(nltk.bigrams(x.split())))
KFP_df['bigrams_score'] = KFP_df['bigrams'].apply(lambda x: len([i for i in x if i in tuple_list]))
KFP_df.drop(columns = ['bigrams'], axis =1, inplace=True)

Style Analysis#

Given that this project evaluates the effectiveness of argumentation, I thought there may be some stylistic features that are more-or-less unique to/characteristic of weak arguments.

  1. ALL CAPS!!!!! (AND EXCLAMATION POINTS!!!!!!!!!SEE ABOVE!!!!!!!) Rather than articulating anger and detailing frustration, all-caps writing suggests the written equivalent of a shouting match.

def caps(x):
    score = 0
    for i in x.split(): 
        if i.isupper():
            score += 1
            
    return score
KFP_df['caps_text'] = KFP_df['discourse_copy'].apply(lambda x: caps(x))
  1. Relatedly, the use of the ad-hominem – personal insults– rather than subtle analysis of flaws in reasoning is a classic fallacy that would, on presumes, be met with aversion by graders.

ad_hominemlst = ['idiot', 'stupid', 'dumb', 'loser', 'losers', 'idiots', 'morons', 'moron', 'liar', 'liars']

def ad_hominem(x):
    score = 0
    for item in x:
        if item in ad_hominemlst: 
            score+=1
    return score

KFP_df['ad_hominem'] = KFP_df['discourse_tokens'].apply(lambda x: ad_hominem(x))
  1. The appearance of a highly infrequent word seems to me to be a bad sign – at least in this context. There are positive situations in which an extremely uncommon word would be a good thing: a spelling bee, a highly technical journal or article, or in one of Christopher Hitchens’ essays. In this context, though, I suspect a highly infrequent word is one that is either a) misspelled, b) non-English, c) completly misused, or d) um…not a word (e.g., an onomatopoeia, a slang term, etc.). (Note, we have to use the original discourse_text variable, rather than the tokens, since we have already removed all tokens that appear less than once.)

KFP_df['weird_tokens']= KFP_df['discourse_copy'].apply(lambda x: \
                                                               len([item for item in x.split() if fdist[item] == 1]))
  1. I’m not sure that it’s exactly ‘incorrect’ to have a text replete with numeric tokens, but I find it distracting and a bit juvenile. When I see ‘there are 44 seats in the house’ instead of ‘there are forty-four seats in the house’, it doesn’t feel quite right to me. I’m not sure Strunk & White would concur, but I still think i’ts worth exploring.

def numerical(text):

    score = 0
    for word in text.split():
        if word.isdigit():
            score +=1
    return score

KFP_df['count_numerical_text'] = KFP_df['discourse_copy'].apply(lambda x: numerical(x))
  1. Sentiment Analysis is another common NLP task, and is blessedly simple to implement using NLTK’s API. Psychological research has shown a bias amongst the literati towards work that is negative in tone – not so much writing that is angry or hostile, but writing that is pessimistic, that covers or describes injustice, disaster, war, political discord, and so on. The idea is that such work is more ‘serious’ than other content, and I can see the same thing happening in politically charged high-school essays.

from nltk.sentiment import SentimentIntensityAnalyzer
disc_sent_analyzer = SentimentIntensityAnalyzer()
nltk.download('vader_lexicon')
KFP_df['polarity'] = KFP_df['discourse_text'].apply(lambda x: disc_sent_analyzer.polarity_scores(x))
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\goffm\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

Standardization#

For almost all of the features added above, the presence of the features and the count of the features may not be particulalry illuminating. What I mean is that given the considerable variance in discourse length any of the given figures as a count could be quite misleading. For example, if a text was a paragraph-long, the presence of some relatively uninformative words wouldn’t be a bad sign, and if it was a sentence-long, we wouldn’t expect much diversity of punctuation. On the other hand, if it was a paragraph-long and had no punctuation (or very little), that would be a big red flag. That in mind, I think the ratio of these features to either the total number of characters or the total number of tokens in the text is a much more informative metric.
Before we take those measurements, however, I need to make sure to avoid requesting division by zero…

def counting(x):
    alpha = len([i for i in x])
    if alpha == 0:
        alpha+=0.1
    
    return alpha

KFP_df['Characters_In_Text'] = KFP_df['discourse_copy'].apply(lambda x: counting(x))
KFP_df['Words_In_Text'] = KFP_df['discourse_text'].apply(lambda x: counting(x))
KFP_df['Tokens_In_Text'] = KFP_df['discourse_tokens'].apply(lambda x: counting(x))

Relatedly, Claude Shannon’s great insight was in realizing that natural language is dense with redundancy. Much of our speech can be condensend into vastly smaller spaces by avoiding repetition (“Avoid repetition! Avoid repetition!” - E.B. White), by not using lots of letters when a few will do the job (“Eschew superfluous verbiage” - Mark Twain), and by eliminating predictable components of speech and language. For example, psyclgsts hve shwn tht we cn rd frly cmplx sntcs evn whn mst vwls are absnt. Cryptographers, likewise, make short work of unsurprising passwords, codes, and on so on simply by counting the number of repeated characters in a masked or encrypted document. (Hint: in English, it’s probably ‘e’.) That in mind, having clipped all of the stop-words, dull and topical terms, white-space, and so forth, and whittled our way down to tokens that aren’t ubiquitous in the document-space, the ratio of what’s left to what we started with seems like it may be indicative of how much useful, original, and informative content was embedded in the text to begin with.

#How much is left: How much we started with:
KFP_df['informative_proportion'] = KFP_df.Tokens_In_Text/KFP_df['Characters_In_Text']

#How much of what's left that's *good* : How much we started with - punctuation, etc.:
KFP_df['proportion_E_tokens'] = KFP_df['count_E_tokens']/KFP_df['Words_In_Text'] 

#How much of what's left is not-so-good: How much we started with - punctuation, etc.:
KFP_df['proportion_I_tokens'] = KFP_df['basic_unigrams']/KFP_df['Words_In_Text']

Likewise, we really need comparable metrics for some of the other features, like exclamation marks and so on:

#Odd Number of quotation marks
KFP_df['Odd_Number_Quotes'] = KFP_df['quotation_marks'].apply(lambda x: 1 if x%2 != 0 else 0)

#Surfeit of exclamation points
KFP_df['exclamation_proportion'] = KFP_df.exclamation_count/KFP_df['Characters_In_Text']

#Summary of poor writing features:
bad_signs = ['count_misspelled_tokens', 
             'contraction_errors_text',
             'caps_text',
             'basic_unigrams', 
             'ad_hominem', 
             'weird_tokens',
            'count_numerical_text',
            'exclamation_count',
            'Odd_Number_Quotes']

KFP_df['bad_signs_score'] = KFP_df[[i for i in bad_signs]].sum(axis = 1)

Part 5: KDD/EDA/Mining/FeatureSelection/OtherSmartSoundingDataWordsAndAcronyms#

Let’s begin by taking a look at exactly what all it is that we have, then sorting it by data-type. We can then explore removing redundant or colinear features, collapsing similar items, binning or otherwise discretizing some of the features if needed, and comparing distributions by discourse effectiveness score.

#We don't need the following variables for analysis: 
to_drop = ['TFV_length', 'essay_id','TextForVisuals', 'frequency_strings']
KFP_df.drop(columns = [i for i in to_drop], axis = 1, inplace = True)

#We need the various 'count'-related items to all be int, not float. 
transform = ['Words_In_Text', 'Tokens_In_Text']
KFP_df[[i for i in transform]] = KFP_df[[i for i in transform]].astype('int64')

#Grouping features of common data types:
TextFeats = ['discourse_text', 'discourse_tokens', 'discourse_copy']
IntFeats = KFP_df.select_dtypes(include= ['int64'])

#The proportion features need to remain floats: 
FloatFeats = ['informative_proportion', 
              'proportion_E_tokens', 
              'proportion_I_tokens', 
              'exclamation_proportion', 
              'Average_Sen_Length']

#Discourse type will work best as a categorical feature
KFP_df.discourse_type = KFP_df.discourse_type.astype('category').cat.codes

#Other: the polarity_score, which comes to us in the form of a dictionary of values: 
otherFeats = ['polarity']
from yellowbrick.features import Rank2D

visualizer1 = Rank2D(algorithm = 'pearson', 
                     size = (996,696),
                    title = "Correlation Between Continuous IVs")

visualizer1.fit_transform(KFP_df[[i for i in IntFeats]])
visualizer1.show();
_images/Body_117_0.png

Unsurprisingly, the ‘tokens’, ‘words,’ and ‘characters’-in-text have a high level of correlation, and there’s considerable correlation with the punctuation_diversity feature as well. We can also see a high positive correlation between the number of n-grams that were numeric, and the ‘bad_signs’ score.

Redundant = ['Words_In_Text', 'Tokens_In_Text', 'Characters_In_Text']
#For these, we can collapse them into a single metric
KFP_df['Length'] = KFP_df[[i for i in Redundant]].sum(axis = 1)
KFP_df['Length'] = KFP_df['Length'].apply(lambda x: round(x/3), 2)
KFP_df.drop(columns = [i for i in Redundant], axis = 1, inplace = True)
#Since this in -- and otherwise strongly correlated with -- the 'bad_signs', we can get rid of it. 
KFP_df.drop(columns = ['count_numerical_text'], axis = 1, inplace = True)
#We can just drop this as well.
KFP_df.drop(columns = ['basic_punctuation'], axis = 1, inplace = True)
#The count of quotation marks was only collected to find texts with an odd number thereof
KFP_df.drop(columns = ['quotation_marks'], axis = 1, inplace = True)
from yellowbrick.features import Rank2D

visualizer1 = Rank2D(algorithm = 'pearson', 
                     size = (596,296),
                    title = "Correlation Between Continuous IVs")

visualizer1.fit_transform(KFP_df[[i for i in FloatFeats]])
visualizer1.show();
_images/Body_120_0.png

Nothing seems to be problematic amongst those variables.

KFP_df.shape
(36765, 29)

29 features is still an awful lot. Since we’ve collapsed the ‘bad_signs’ – each of which in isolation may not be a particularly strong predictor– I think it’s probably okay to drop those. We also still have the 3 variations on the text, even though only the tokenized and lemmatize version will likely be needed during model-building.

to_drop_2 = ['discourse_text', 'discourse_copy']
bad_signs.remove('count_numerical_text')
to_drop_2+= bad_signs
KFP_df.drop(columns = [i for i in to_drop_2], axis = 1, inplace = True)

Yellowbrick has another very cool feature-selection visual regressor that we can use to make an estimate as to the predictive potential of our features.

x = KFP_df[[i for i in FloatFeats]]
y = KFP_df.discourse_effectiveness.to_list()
from yellowbrick.model_selection import FeatureImportances
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class = 'auto', solver = 'liblinear')
visualizer3 = FeatureImportances(model, 
                                 stack = True, 
                                 relative = False, 
                                 xlabel = '1')
visualizer3.fit(x,y)
visualizer3.show();
_images/Body_126_0.png

Clearly the informative-proportion and proportion of tokens that are in the list associated with effectiveness are distinctly more predictive.

KFP_df.drop(columns = ['proportion_I_tokens', 'Average_Sen_Length', 'exclamation_proportion'], axis = 1, inplace = True)
Non_numeric_Feats = ['discourse_id', 'discourse_type', 'discourse_effectiveness', 'discourse_tokens', 'polarity']
FloatFeats2 = ['informative_proportion', 'proportion_E_tokens']
RemainingIntFeats = [i for i in KFP_df.columns if i not in Non_numeric_Feats and i not in FloatFeats2]
x = KFP_df[[i for i in RemainingIntFeats]]
y = KFP_df.discourse_effectiveness.to_list()
from yellowbrick.model_selection import FeatureImportances
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class = 'auto', solver = 'liblinear')
visualizer3 = FeatureImportances(model, 
                                 stack = True, 
                                 relative = False, 
                                 xlabel = '1')
visualizer3.fit(x,y)
visualizer3.show();
_images/Body_129_0.png

Well, that’s interesting. I didn’t anticipate that the ‘conservatism’ score would be nearly as important as it seems here, nor that it would predict significantly across effectiveness scores. I also find it a little surprising that length – whether using the composite character-token-word metric, or using textstat’s average sentence length– has so little impact. We’ll go ahead and drop it, since the word-vectors will more-or-less reflect length anyway.

There doesn’t seem to be a need for the punctuation metrics much at all, but we can collapse them. The count_E and E_exclusive appear to be capturing the same things in the same directions, so we can just take the mean of those.

to_drop_4 = ['Length']
#Collapsing Effective Tokens
Effective_Tokens = ['count_E_tokens', 'E_Exclusive_tokens']
KFP_df['Effective_Tokens'] = KFP_df[[i for i in Effective_Tokens]].sum(axis = 1)
KFP_df['Effective_Tokens'] = KFP_df[['Effective_Tokens']].apply(lambda x: round((x/2), 2))
#Punctuation
Punctuation = ['punctuation_diversity_text', 'positive_punctuation' ]
KFP_df['Punctuation'] = KFP_df[[i for i in Punctuation]].sum(axis = 1)
KFP_df['Punctuation'] = KFP_df['Punctuation'].apply(lambda x: round(x/2), 2)
#Dropping unnecessary variables
to_drop_4 += Effective_Tokens + Punctuation
KFP_df.drop(columns = [t for t in to_drop_4], axis = 1, inplace = True)

We haven’t looked yet at the polarity scores, which are a separate data structure (dictionary), but they are formatted as shown in the cell below. I think it will be easier to grasp what– if any – value they add if we split out the dictionary into it’s types.

KFP_df.polarity[1]
{'neg': 0.0, 'neu': 0.688, 'pos': 0.312, 'compound': 0.6124}
KFP_df['Neg'] = KFP_df['polarity'].apply(lambda score_dict: score_dict['neg'])
KFP_df['Neu'] = KFP_df['polarity'].apply(lambda score_dict: score_dict['neu'])
KFP_df['Pos'] = KFP_df['polarity'].apply(lambda score_dict: score_dict['pos'])
KFP_df['Compound'] = KFP_df['polarity'].apply(lambda score_dict: score_dict['compound'])
KFP_df.drop(columns = ['polarity'], axis = 1 ,inplace = True)
import plotly.express as px
df = KFP_df
fig = px.box(df, 
             x = 'Compound', 
             color = 'discourse_effectiveness', 
             notched= True,  
             title = 'Quantiles of Mean Polarity by Discourse Effectiveness',
            labels = {'Compound': "Compound Polarity Score"})
fig.update_traces(quartilemethod = 'exclusive')
fig.update_layout(title = {'xanchor': 'left'})
fig.show()

Apart from a very slight inclination towards greater positivity in effective essays, there doesn’t appear to be much in the way of distinctiveness nor consistency in this feature.

KFP_df.drop(columns = ['Neg', 'Neu', 'Pos', 'Compound'], axis = 1, inplace = True)
FinalFeatures = KFP_df.select_dtypes(include= ['int64', 'float64'])
from yellowbrick.features import Rank2D

visualizer1 = Rank2D(algorithm = 'pearson', 
                     size = (596,296),
                    title = "Correlation Between Continuous IVs")

visualizer1.fit_transform(KFP_df[[i for i in FinalFeatures]])

visualizer1.show();
_images/Body_139_0.png

Apart from an obvious degree of correlation between Effective_Tokens and Proportion_E_tokens, everything seems pretty independent. While all of the engineered features are integers or floats, the majority aren’t well-conceived as continuous predictors, with the exception of ‘informative_proportion’:

KFP_df['informative_proportion'].hist(figsize = (8,6), xrot = 45, bins = 45)
plt.show()
_images/Body_141_0.png
#Stratified by Effectiveness Score
fig=px.histogram(data_frame=KFP_df,
                 x=KFP_df.informative_proportion,
                 marginal="violin",
                 color=KFP_df.discourse_effectiveness)

fig.update_layout(title="Distribution of Informative Proportion By Discourse_Effectiveness:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

The remaining variables, it seems to me, are better binned – many of them are zero, and given the variance in text-length, it seems like the mere presence of them may be the primary signal.

KFP_df.proportion_E_tokens = KFP_df.proportion_E_tokens.apply(lambda x: 1 if x > 0 else 0)
discBadSigns = KFP_df['proportion_E_tokens']
discEff = KFP_df['discourse_effectiveness']
newb = pd.concat([discBadSigns, discEff], axis = 1)
newbCT6 = pd.crosstab(discBadSigns, discEff, normalize = 'index')
newbCT6
plot5 = newbCT6.plot.barh(stacked = True, color = ['gold', 'forestgreen','firebrick'])
plot5.set(xlabel= 'Effectiveness Proportion', ylabel = 'Use of Effective Tokens')
plot5.legend(bbox_to_anchor = (1.05, .6));
_images/Body_145_0.png
def count_cat(x): 
    cat = 0
    if x in (1,2):
        cat+=1
    if x in (3,4): 
        cat+=2
    if x in (5,6):
        cat+=3
    if x in (7,8): 
        cat += 4
    if x > 9:
        cat+= 5
    return cat

KFP_df.Effective_Tokens = KFP_df.Effective_Tokens.apply(lambda x: count_cat(x))
KFP_df.effective_spelling = KFP_df.effective_spelling.apply(lambda x: count_cat(x))
KFP_df.bad_signs_score = KFP_df.bad_signs_score.apply(lambda x: count_cat(x))
KFP_df.Punctuation = KFP_df.Punctuation.apply(lambda x: count_cat(x))
KFP_df.conservatism_score = KFP_df.conservatism_score.apply(lambda x: count_cat(x))
KFP_df.bigrams_score = KFP_df.bigrams_score.apply(lambda x: count_cat(x))
binnable = ['effective_spelling', 
            'conservatism_score', 
            'bigrams_score', 
            'bad_signs_score', 
            'Effective_Tokens', 
            'Punctuation']
KFP_df[[i for i in binnable]].describe()
effective_spelling conservatism_score bigrams_score bad_signs_score Effective_Tokens Punctuation
count 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000
mean 0.092262 0.002530 0.020046 0.709887 0.066367 0.644689
std 0.303528 0.050771 0.141320 0.938856 0.272306 0.653189
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000
max 4.000000 2.000000 2.000000 5.000000 5.000000 5.000000
KFP_df.describe()
discourse_type effective_spelling conservatism_score bigrams_score informative_proportion proportion_E_tokens bad_signs_score Effective_Tokens Punctuation
count 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000 36765.000000
mean 2.174895 0.092262 0.002530 0.020046 0.061239 0.186999 0.709887 0.066367 0.644689
std 1.862446 0.303528 0.050771 0.141320 0.017139 0.389916 0.938856 0.272306 0.653189
min 0.000000 0.000000 0.000000 0.000000 0.001235 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.050725 0.000000 0.000000 0.000000 0.000000
50% 3.000000 0.000000 0.000000 0.000000 0.060890 0.000000 0.000000 0.000000 1.000000
75% 3.000000 0.000000 0.000000 0.000000 0.071212 0.000000 1.000000 0.000000 1.000000
max 6.000000 4.000000 2.000000 2.000000 0.250000 1.000000 5.000000 5.000000 5.000000
fig, ax = plt.subplots(3,2,figsize = (20,30))
for variable, subplot in zip(binnable, ax.flatten()):
    sns.countplot(x = KFP_df['discourse_effectiveness'], ax = subplot, hue= KFP_df[variable])
    for label in subplot.get_xticklabels():
        label.set_rotation(90)
_images/Body_149_0.png

So, taking these 1-by-1:

  1. Effective spelling is linearly distributed – in that it increases from ineffective-to-adequate-to-effective– but it is so sparse, it’s unlikely to be very helpful. The big challenge for the algorithm will be in using the presence or absence to differentiate at the edges of Ineffective-Adequate, and Adequate-Effective. Telling ineffective from effective will likely be much easier, but if there’s not a distinctive separation between adequate and one of the extremes, we can’t get anything out the feature. I’m going to drop this due to not being sufficiently dense.

  2. Bigrams is virtually indistinguishable when present at all. Drop.

  3. Effective tokens has the same problem as [1]. Drop.

  4. Conservatism may be informative, but it’s probably too sparse to be useful.

  5. Bad signs looks like it’s worth exploring in more depth. We are really more interested in proportions than counts, so we can investigate that with a stacked bar chart.

  6. Punctuation, as well, is hard to say.

KFP_df.drop(columns = ['effective_spelling', 
                       'bigrams_score', 
                       'conservatism_score', 
                       'Effective_Tokens'], 
            axis = 1, 
            inplace = True)
discBadSigns = KFP_df['bad_signs_score']
discEff = KFP_df['discourse_effectiveness']
newb = pd.concat([discBadSigns, discEff], axis = 1)
newbCT3 = pd.crosstab(discBadSigns, discEff, normalize = 'index')
newbCT3
plot3 = newbCT3.plot.barh(stacked = True, color = ['gold', 'forestgreen','firebrick'])
plot3.set(xlabel= 'Effectiveness Proportion', ylabel = 'Bad Signs Score')
plot3.legend(bbox_to_anchor = (1.05, .6));
_images/Body_152_0.png

So, bad signs pretty much gets progressively associated with ineffectiveness. That’s a good thing. But where the proportions change, it’s coming from adequate– Effective essays are proportional irrespective of how many bad signs they have.

discBadSigns = KFP_df['Punctuation']
discEff = KFP_df['discourse_effectiveness']
newb = pd.concat([discBadSigns, discEff], axis = 1)
newbCT4 = pd.crosstab(discBadSigns, discEff, normalize = 'index')
newbCT4
plot4 = newbCT4.plot.barh(stacked = True, color = ['gold', 'forestgreen','firebrick'])
plot4.set(xlabel= 'Effectiveness Proportion', ylabel = 'Punctuation Score')
plot4.legend(bbox_to_anchor = (1.05, .6));
_images/Body_154_0.png

So, a punctuation score of 2, 3, or 4 is the sweet-spot – Effectives tend to concentrate there. Too much = ineffective; too little = adequate. I think this is also worth keeping.

That leaves us with 1 continuous predictor, naturally scaled to 0-1, and roughly normal in distribution: informative_proportion. One given categorical variable: discourse_type. One text variable: discourse_tokens. One binary variable: ‘proportion_E_tokens’. One positively-oriented ordinal variable: Punctuation. One negatively-oriented ordinal variable: bad_signs_score. An identifier: discourse_id. And one dependent variable: discourse_effectiveness.

Part 6: Model-Building and Evaluation#

I won’t make this textbook beta-test any longer by including the full code for every model, but I’ll give you enough here to get a sense of what I did. All of the model specs are available on my github if you’d like to explore them in greater detail. Better yet, go check out my Tip o’ the Hat links, since in many cases mine were adaptations of related configurations, and those were made by much more skilled and competent analysts.

A basic challenge at this point in a project of this kind is discerning not only what variables to keep and what to drop, but also figuring how to combine so many different data types into a single model. As mentioned above, we have very nearly every different data type you can have, not all of which are well-suited to the same kind of algorithm or model.

  1. Sklearn Simple Models

  • Logistic Regression (Text-Only)

  • Multinomial Naive Bayes (Text-Only)

  • Support Vector Machine (Text-Only)

  • Random Forest Classifier (Eng. Features- Only)

  • AdaBoost Classifier (Eng. Features-Only)

  • XGBoost Classifier (Eng. Features-Only)

  1. Sklearn Integrated Features Pipelines

  2. Sklearn Enesembles

  3. Bigram Analysis

  4. Features-as-Text Models

  5. Doc2Vec

So, in the early stages, I want to see what the unadorned text does in unadorned models, to get a sense of both a) which models work more or less effectively for the content we have, and b) to establish a baseline for comparing more complex models. If you look on Kaggle, you’ll see, for example, ~200 submissions that have a log loss > .80, which is what we achieve with just an out-of-the-box logistic regression below. The competition had two tracks– one for accuracy and one for efficiency, so establishing this baseline with an essentially instant model is helpful in considering the accuracy/efficiency tradeoff.

#Train-Test-Split
from sklearn.model_selection import train_test_split as tts
#Processing and combining
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
#Exploratory Models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
#Evaluative programs
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
#Join tokens into complete strings to feed to vectorizer
KFP_df['model_tokens'] = KFP_df.discourse_tokens.apply(lambda x: ' '.join([i for i in x]))
#Train-Test Split
x = KFP_df.model_tokens
y = KFP_df.discourse_effectiveness.to_list()
x_train, x_test, y_train, y_test = tts(x, y, test_size = 0.3, random_state = 42)

Model 1: Logistic Regression: The CountVectorizer assembles word vectors with a frequency-based dictionary which builds a count-matrix.
The Tfidf_Transformer performs a logarithmic transform to normalize the count-matrix. The ‘tf’ indicates ‘term-frequency’ and the ‘idf’ indicates the ‘inverse document frequency.’ Simply put, it prevents extremely common words from drowning out infrequent words in terms of feature weight. The Pipeline is a chaining method, allowing multiple sequential processes to be simply listed, rather than expressly performed item-by-item. Note that while fit_transform would typically be applied to the training data, whereas only ‘fit’ would be applied to the test data, this sequence is not necessary when using the pipeline.

text_clf1 = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression(max_iter = 5000)),])

text_clf1 = text_clf1.fit(x_train, y_train)

pred = text_clf1.predict(x_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
labels = ['Effective', 'Ineffectve', 'Adequate']
pred = text_clf1.predict_proba(x_test)
evaluate = log_loss(y_test , pred) 
print(f"Log-Loss = {round(evaluate, 3)}")
[[5516  571  225]
 [1528 1242   21]
 [1620   54  253]]
              precision    recall  f1-score   support

    Adequate       0.64      0.87      0.74      6312
   Effective       0.67      0.45      0.53      2791
 Ineffective       0.51      0.13      0.21      1927

    accuracy                           0.64     11030
   macro avg       0.60      0.48      0.49     11030
weighted avg       0.62      0.64      0.59     11030
Log-Loss = 0.808

The log-loss metric printed at the bottom is the metric used in the competition evaluation: it permits classification performance in imbalanced datasets– like this one. We have, for example, ~57% ‘Adequate’ scores, so the heuristic ‘Always predict Adequate’ would perform better than 50/50 (or 33/33/33), but would not add anything. With log-loss, lower is better, indicating the probabilities assigned to each target class for each observation are significantly favoring the correct class. .808 corresponds, as we can see, to a ~8-10% accuracy improvement over the naive heuristic; the winning submissions had a log-loss of ~.554.

The next baseline model is a Multinomial Naive Bayes, run through the same pipeline.

text_clf2 = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),])

text_clf2 = text_clf2.fit(x_train, y_train)

pred = text_clf2.predict(x_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
labels = ['Effective', 'Ineffectve', 'Adequate']
pred = text_clf2.predict_proba(x_test)
evaluate = log_loss(y_test , pred) 
print(f"Log-Loss = {round(evaluate, 3)}")
[[5908  347   57]
 [1855  935    1]
 [1780   28  119]]
              precision    recall  f1-score   support

    Adequate       0.62      0.94      0.75      6312
   Effective       0.71      0.34      0.46      2791
 Ineffective       0.67      0.06      0.11      1927

    accuracy                           0.63     11030
   macro avg       0.67      0.44      0.44     11030
weighted avg       0.65      0.63      0.56     11030
Log-Loss = 0.852

Here, again, we can sense the imperfect alignment of log-loss and accuracy: while we only decreased in accuracy by one percent, our log-loss jumped ~5 percentage points. The MNB must have generated less confident probabilities, despite ultimately concluding on approximately the same number of categorical guesses.

Our last baseline attempt is a support vector machine, which does reasonably well (though not as well as the logistic regressor), but which doesn’t have a built-in method for outputting a probability estimate and computing the log-loss.

text_clf3 = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LinearSVC()),])

text_clf3 = text_clf3.fit(x_train, y_train)

pred = text_clf3.predict(x_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
labels = ['Effective', 'Ineffectve', 'Adequate']
[[5102  790  420]
 [1355 1372   64]
 [1477  104  346]]
              precision    recall  f1-score   support

    Adequate       0.64      0.81      0.72      6312
   Effective       0.61      0.49      0.54      2791
 Ineffective       0.42      0.18      0.25      1927

    accuracy                           0.62     11030
   macro avg       0.56      0.49      0.50     11030
weighted avg       0.59      0.62      0.59     11030

Next, we can try iteratively adding in some of the features we engineered, to see if they add anything. Decision Tree models can serve as a good non-parametric approach for using unscaled and disparate data types. Note that we are converting the target – discourse effectiveness – from it’s given form (“Adequate, Effective, Ineffective”) to type(‘category’) and then taking cat.codes. This is essentially a categorical encoding, just the extra-easy way.

featsX = ['discourse_type', 
            'informative_proportion', 
            'proportion_E_tokens', 
            'bad_signs_score', 
            'Punctuation']

x = KFP_df[[i for i in featsX]]
y = KFP_df.discourse_effectiveness.astype('category').cat.codes
x_train, x_test, y_train, y_test = tts(x, y, test_size = 0.3, random_state = 42)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

rfc1 = rfc.fit(x_train, y_train)

pred = rfc1.predict(x_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
labels = ['Effective', 'Ineffectve', 'Adequate']
pred = rfc1.predict_proba(x_test)
evaluate = log_loss(y_test , pred) 
print(f"Log-Loss = {round(evaluate, 3)}")
[[4268 1112  932]
 [1576  904  311]
 [1214  305  408]]
              precision    recall  f1-score   support

           0       0.60      0.68      0.64      6312
           1       0.39      0.32      0.35      2791
           2       0.25      0.21      0.23      1927

    accuracy                           0.51     11030
   macro avg       0.41      0.40      0.41     11030
weighted avg       0.49      0.51      0.49     11030
Log-Loss = 3.324
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada1 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200)
ada1.fit(x_train, y_train)

print('Training Accuracy: {:.2f}'.format(ada1.score(x_train, y_train)))
print('TEST Accuracy:  {:.2f}'.format(ada1.score(x_test, y_test)))

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
pred = ada1.predict(x_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
pred = ada1.predict_proba(x_test)
evaluate = log_loss(y_test , pred) 
print(f"Log-Loss = {round(evaluate, 3)}")
Training Accuracy: 0.60
TEST Accuracy:  0.60
[[5642  613   57]
 [1922  844   25]
 [1700  149   78]]
              precision    recall  f1-score   support

           0       0.61      0.89      0.72      6312
           1       0.53      0.30      0.38      2791
           2       0.49      0.04      0.07      1927

    accuracy                           0.60     11030
   macro avg       0.54      0.41      0.39     11030
weighted avg       0.57      0.60      0.52     11030

Log-Loss = 1.096
from xgboost import XGBClassifier

from sklearn.metrics import f1_score, accuracy_score
xgb3 = XGBClassifier(random_state = 42, num_class = 3)
xgb3.fit(x_train, y_train)
pred = xgb3.predict(x_test)
cf = confusion_matrix(y_test, pred)
print(cf)
print(classification_report(y_test, pred))
pred = xgb3.predict_proba(x_test)
evaluate = log_loss(y_test , pred) 
print(f"Log-Loss = {round(evaluate, 3)}")
[[5579  581  152]
 [1876  869   46]
 [1645  134  148]]
              precision    recall  f1-score   support

           0       0.61      0.88      0.72      6312
           1       0.55      0.31      0.40      2791
           2       0.43      0.08      0.13      1927

    accuracy                           0.60     11030
   macro avg       0.53      0.42      0.42     11030
weighted avg       0.56      0.60      0.54     11030

Log-Loss = 0.901
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.heatmap(cf/np.sum(cf), annot = True, fmt= '.2%', cmap = 'Greens')

ax.set_title('Confusion Matrix for FeatureUnion XGB\n\n')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('\nActual Values\n')

ax.xaxis.set_ticklabels(['Adequate', 'Effective', 'Ineffective'])
ax.yaxis.set_ticklabels(['Adequate', 'Effective', 'Ineffective'])
plt.show()
_images/Body_171_0.png

I find it interesting that the xgboost – using only the features we gave it– is more capable of identifying the effective items than the ineffective: it only gets 1/16 correct there.

We can see what it found useful using the following:

from xgboost import plot_importance
from matplotlib import pyplot

plot_importance(xgb3)
pyplot.show();
_images/Body_173_0.png
feat_gains = xgb3.get_booster().get_score(importance_type = 'gain')
pyplot.bar(feat_gains.keys(), feat_gains.values());
pyplot.xticks(rotation = 90);
_images/Body_174_0.png

Curiously, the ‘gains’ (the graph on the bottom) suggests that the model finds the proportion of effective tokens to be its most valuable predictor. Additionally, we perform almost as well as the text models (not that those are especially stellar, but still) using only the extracted features. There’s some debate as to which of the XGBoost importance-types (‘weight’, ‘gain’, ‘total_gain’, ‘cover’, and ‘total_cover’) are most valuable and informative. The coverage metric deals with how many observations were influenced by a particular split, while the gains deals with the relative strength of the feature as a predictor. These are exactly inverted with respect to the ‘proportion_E_tokens’: it has the least coverage (probably because there are so many zero-values), but it’s presence or absence makes a big difference. This reinforces the observation I made above about this competition hinging upon the ‘edge-cases’: items near the cutoff thresholds from Adequate-to-Effective/Ineffective. It appears that the proportion of strong tokens does indeed help to differentiate Effectives from Adequates. But the bad signs and the informative proportion (though widely applicable) don’t add much in terms of distinguishing Ineffectives from Adequates.

The ‘SHAP’ (Shapley Additive Explanations) model for determining feature importance was an entirely separate approach that was strongly endorsed on many forums and in many articles. There is a module for computing this at the link below, but I am totally unfamiliar with the (game-theoretic) approach, and have added learning it to my to-do list, but haven’t implemented it here. (Tip o’ the Hat to the folks at Link.)

Anyway, a different approach to model-comparison is given below, wherein bigram frequency is also considered, and then model-selection is assisted by a cross-validation process that iteratively tries different sklearn algorithms. I include it because it is cool and because it illustrates some valuable tools/approaches, but I didn’t end up using it. Since the process involves re-mapping some features we have already converted, as well as the vectorizer replicating processes we’ve already applied, I’m switching back to the original dataset for this example. (Tips o’ the Hat to Link1. and Link2.)

dfcopy = pd.read_csv('train.csv')
dfcopy['effectiveness_id'] = dfcopy['discourse_effectiveness'].factorize()[0]
from io import StringIO
category_id_df = dfcopy[['discourse_effectiveness', 'effectiveness_id']].drop_duplicates().sort_values('effectiveness_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['effectiveness_id', 'discourse_effectiveness']].values)
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

vec_tfidf = TfidfVectorizer(ngram_range = (1,2), 
                            sublinear_tf = True,
                            analyzer = 'word', 
                            norm = 'l2',
                            max_df = 15000,
                            min_df = 15,                  #Min_df/max_df == the min/max times an ngram has to occur
                            encoding = 'latin-1',
                            stop_words = stop_words)

discourse = vec_tfidf.fit_transform(dfcopy.discourse_text).toarray()
labels = dfcopy.effectiveness_id
discourse.shape
(36765, 9450)
from sklearn.feature_selection import chi2
import numpy as np

N = 5

for discourse_effectiveness, effectiveness_id in sorted(category_to_id.items()):
    discourse_chi2 = chi2(discourse, labels == effectiveness_id)
    indices = np.argsort(discourse_chi2[0])
    feature_names = np.array(vec_tfidf.get_feature_names_out())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("'{}':".format(discourse_effectiveness))
print("  Most correlated unigrams:\n       . {}".format('\n       . '.join(unigrams[-N:])))
print("  Most correlated bigrams:\n       . {}".format('\n       . '.join(bigrams[-N:])))
'Ineffective':
  Most correlated unigrams:
       . lisa
       . mona
       . venus
       . consists
       . students
  Most correlated bigrams:
       . college consists
       . electors majority
       . consists 538
       . 538 electors
       . mona lisa
X_train, X_test, y_train, y_test = tts(dfcopy['discourse_text'], 
                                       dfcopy['discourse_effectiveness'], 
                                       random_state = 0)
models = [
    RandomForestClassifier(n_estimators = 200, max_depth = 3, random_state = 0),
    LinearSVC(),
    MultinomialNB(), 
    LogisticRegression(random_state = 0, max_iter = 1000)
]

CV = 5
cv_df = pd.DataFrame(index = range(CV * len(models)))
from sklearn.model_selection import cross_val_score

entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, discourse, labels, scoring = 'accuracy', cv = CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
        
cv_df = pd.DataFrame(entries, columns= ['model_name', 'fold_idx', 'accuracy'])
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Input In [117], in <cell line: 4>()
      4 for model in models:
      5     model_name = model.__class__.__name__
----> 6     accuracies = cross_val_score(model, discourse, labels, scoring = 'accuracy', cv = CV)
      7     for fold_idx, accuracy in enumerate(accuracies):
      8         entries.append((model_name, fold_idx, accuracy))

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\model_selection\_validation.py:509, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    506 # To ensure multimetric format is not supported
    507 scorer = check_scoring(estimator, scoring=scoring)
--> 509 cv_results = cross_validate(
    510     estimator=estimator,
    511     X=X,
    512     y=y,
    513     groups=groups,
    514     scoring={"score": scorer},
    515     cv=cv,
    516     n_jobs=n_jobs,
    517     verbose=verbose,
    518     fit_params=fit_params,
    519     pre_dispatch=pre_dispatch,
    520     error_score=error_score,
    521 )
    522 return cv_results["test_score"]

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\model_selection\_validation.py:267, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    264 # We clone the estimator to make sure that all the folds are
    265 # independent, and that it is pickle-able.
    266 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 267 results = parallel(
    268     delayed(_fit_and_score)(
    269         clone(estimator),
    270         X,
    271         y,
    272         scorers,
    273         train,
    274         test,
    275         verbose,
    276         None,
    277         fit_params,
    278         return_train_score=return_train_score,
    279         return_times=True,
    280         return_estimator=return_estimator,
    281         error_score=error_score,
    282     )
    283     for train, test in cv.split(X, y, groups)
    284 )
    286 _warn_about_fit_failures(results, error_score)
    288 # For callabe scoring, the return type is only know after calling. If the
    289 # return type is a dictionary, the error scores can now be inserted with
    290 # the correct key.

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:1046, in Parallel.__call__(self, iterable)
   1043 if self.dispatch_one_batch(iterator):
   1044     self._iterating = self._original_iterator is not None
-> 1046 while self.dispatch_one_batch(iterator):
   1047     pass
   1049 if pre_dispatch == "all" or n_jobs == 1:
   1050     # The iterable was consumed all at once by the above for loop.
   1051     # No need to wait for async callbacks to trigger to
   1052     # consumption.

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:861, in Parallel.dispatch_one_batch(self, iterator)
    859     return False
    860 else:
--> 861     self._dispatch(tasks)
    862     return True

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:779, in Parallel._dispatch(self, batch)
    777 with self._lock:
    778     job_idx = len(self._jobs)
--> 779     job = self._backend.apply_async(batch, callback=cb)
    780     # A job can complete so quickly than its callback is
    781     # called before we get here, causing self._jobs to
    782     # grow. To ensure correct results ordering, .insert is
    783     # used (rather than .append) in the following line
    784     self._jobs.insert(job_idx, job)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\_parallel_backends.py:572, in ImmediateResult.__init__(self, batch)
    569 def __init__(self, batch):
    570     # Don't delay the application, to avoid keeping the input
    571     # arguments in memory
--> 572     self.results = batch()

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:262, in BatchedCalls.__call__(self)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:262, in <listcomp>(.0)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\fixes.py:216, in _FuncWrapper.__call__(self, *args, **kwargs)
    214 def __call__(self, *args, **kwargs):
    215     with config_context(**self.config):
--> 216         return self.function(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\model_selection\_validation.py:680, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    678         estimator.fit(X_train, **fit_params)
    679     else:
--> 680         estimator.fit(X_train, y_train, **fit_params)
    682 except Exception:
    683     # Note fit time as time until error
    684     fit_time = time.time() - start_time

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py:450, in BaseForest.fit(self, X, y, sample_weight)
    439 trees = [
    440     self._make_estimator(append=False, random_state=random_state)
    441     for i in range(n_more_estimators)
    442 ]
    444 # Parallel loop: we prefer the threading backend as the Cython code
    445 # for fitting the trees is internally releasing the Python GIL
    446 # making threading more efficient than multiprocessing in
    447 # that case. However, for joblib 0.12+ we respect any
    448 # parallel_backend contexts set at a higher level,
    449 # since correctness does not rely on using threads.
--> 450 trees = Parallel(
    451     n_jobs=self.n_jobs,
    452     verbose=self.verbose,
    453     **_joblib_parallel_args(prefer="threads"),
    454 )(
    455     delayed(_parallel_build_trees)(
    456         t,
    457         self,
    458         X,
    459         y,
    460         sample_weight,
    461         i,
    462         len(trees),
    463         verbose=self.verbose,
    464         class_weight=self.class_weight,
    465         n_samples_bootstrap=n_samples_bootstrap,
    466     )
    467     for i, t in enumerate(trees)
    468 )
    470 # Collect newly grown trees
    471 self.estimators_.extend(trees)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:1046, in Parallel.__call__(self, iterable)
   1043 if self.dispatch_one_batch(iterator):
   1044     self._iterating = self._original_iterator is not None
-> 1046 while self.dispatch_one_batch(iterator):
   1047     pass
   1049 if pre_dispatch == "all" or n_jobs == 1:
   1050     # The iterable was consumed all at once by the above for loop.
   1051     # No need to wait for async callbacks to trigger to
   1052     # consumption.

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:861, in Parallel.dispatch_one_batch(self, iterator)
    859     return False
    860 else:
--> 861     self._dispatch(tasks)
    862     return True

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:779, in Parallel._dispatch(self, batch)
    777 with self._lock:
    778     job_idx = len(self._jobs)
--> 779     job = self._backend.apply_async(batch, callback=cb)
    780     # A job can complete so quickly than its callback is
    781     # called before we get here, causing self._jobs to
    782     # grow. To ensure correct results ordering, .insert is
    783     # used (rather than .append) in the following line
    784     self._jobs.insert(job_idx, job)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\_parallel_backends.py:572, in ImmediateResult.__init__(self, batch)
    569 def __init__(self, batch):
    570     # Don't delay the application, to avoid keeping the input
    571     # arguments in memory
--> 572     self.results = batch()

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:262, in BatchedCalls.__call__(self)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\joblib\parallel.py:262, in <listcomp>(.0)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\fixes.py:216, in _FuncWrapper.__call__(self, *args, **kwargs)
    214 def __call__(self, *args, **kwargs):
    215     with config_context(**self.config):
--> 216         return self.function(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py:185, in _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees, verbose, class_weight, n_samples_bootstrap)
    182     elif class_weight == "balanced_subsample":
    183         curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices)
--> 185     tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
    186 else:
    187     tree.fit(X, y, sample_weight=sample_weight, check_input=False)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py:937, in DecisionTreeClassifier.fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    899 def fit(
    900     self, X, y, sample_weight=None, check_input=True, X_idx_sorted="deprecated"
    901 ):
    902     """Build a decision tree classifier from the training set (X, y).
    903 
    904     Parameters
   (...)
    934         Fitted estimator.
    935     """
--> 937     super().fit(
    938         X,
    939         y,
    940         sample_weight=sample_weight,
    941         check_input=check_input,
    942         X_idx_sorted=X_idx_sorted,
    943     )
    944     return self

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py:420, in BaseDecisionTree.fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    409 else:
    410     builder = BestFirstTreeBuilder(
    411         splitter,
    412         min_samples_split,
   (...)
    417         self.min_impurity_decrease,
    418     )
--> 420 builder.build(self.tree_, X, y, sample_weight)
    422 if self.n_outputs_ == 1 and is_classifier(self):
    423     self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt: 
import seaborn as sns

sns.boxplot(x = 'model_name', 
            y = 'accuracy', 
            data = cv_df)

sns.stripplot(x = 'model_name',
              y = 'accuracy',
              data = cv_df,
              size = 15, 
              jitter = True,
              edgecolor = 'black',
              linewidth = 2)

plt.show();

cv_df.groupby('model_name').accuracy.mean()
_images/Body_185_0.png
model_name
LinearSVC                 0.593526
LogisticRegression        0.623909
MultinomialNB             0.602802
RandomForestClassifier    0.570597
Name: accuracy, dtype: float64
from sklearn.model_selection import train_test_split as tts
model = LogisticRegression(random_state = 42, max_iter = 1000)

X_train, X_test, y_train, y_test, indices_train, indices_test = tts(discourse,
                                                                   labels,
                                                                   dfcopy.index,
                                                                   test_size = 0.33,
                                                                    random_state = 0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize = (8,6))
sns.heatmap(conf_mat,
           annot = True,
           fmt= 'd',
           xticklabels = category_id_df.discourse_effectiveness.values,
           yticklabels = category_id_df.discourse_effectiveness.values)

plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show();
_images/Body_187_0.png
from IPython.display import display

for predicted in category_id_df.effectiveness_id:
    for actual in category_id_df.effectiveness_id:
        if predicted != actual and conf_mat[actual, predicted] >= 10:
            print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
            display(dfcopy.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['discourse_effectiveness', 'discourse_text']])
            print('')

model.fit(discourse, labels)
'Ineffective' predicted as 'Adequate' : 1706 examples.
discourse_effectiveness discourse_text
23122 Ineffective they need to keep the most popular one by elec...
29753 Ineffective To change the way something has been done for ...
8405 Ineffective Maybe its not just the students and the parent...
735 Ineffective The fact that there is a possiblity of there o...
4431 Ineffective Venus might be dangers to go to and explore bu...
... ... ...
24774 Ineffective Many People dies in car accident. People have ...
4359 Ineffective Why should we limit our uses on car usages
20707 Ineffective Our government bases off what the people want ...
129 Ineffective Scientest are working hard to accheve this goa...
22524 Ineffective The video conferencing you need to pay for a c...

1706 rows × 2 columns

'Effective' predicted as 'Adequate' : 1573 examples.
discourse_effectiveness discourse_text
28105 Effective First of all, in some cases the popular vote m...
36384 Effective If they are a bad influence on people, you sho...
11962 Effective This mechanism is able to determine the emotio...
7273 Effective Extracurricular activities should be mandatory...
25725 Effective To me the country has it all wrong, I think we...
... ... ...
31867 Effective It is still possible to have a tie though beca...
21808 Effective As you probably already know, each state's ".....
31367 Effective Secondly, I strongly believe state senator sho...
654 Effective For students, summer projects can feel like a ...
25639 Effective The driverless cars, aren't even close to driv...

1573 rows × 2 columns

'Adequate' predicted as 'Ineffective' : 302 examples.
discourse_effectiveness discourse_text
6143 Adequate Besides, its not everydy you can judge emotion...
1704 Adequate if our sister planet is so inhospitable, why a...
22363 Adequate Others think its a icon for something a alien ...
13420 Adequate Yes you can see how that person is "feeling by...
26528 Adequate Do you think that a kid or teen texting in the...
... ... ...
8075 Adequate than alwys studying our panet and Venus someti...
22469 Adequate Each candidate had his or her group of elector...
6122 Adequate It's as simple as a fake or real smile, the ri...
737 Adequate The first fact, the atmosphere of venis is 97 ...
30081 Adequate I mean we do because we get to vote and all bu...

302 rows × 2 columns

'Effective' predicted as 'Ineffective' : 30 examples.
discourse_effectiveness discourse_text
14844 Effective Since most of the big states already have thei...
27696 Effective Not only does the Electoral College have more ...
33739 Effective if someone gives you advice that you do not ag...
34143 Effective even if people aren't there your peers post bl...
18557 Effective The founding fathers established it in the Con...
28779 Effective It's the Tuesday after the first Monday in Nov...
31056 Effective Each party selects a slate of electors that ar...
14810 Effective Mr. President, I have explained to you what th...
18032 Effective Whats up with the electoral college? Can we re...
11599 Effective They come in red, blue, green, black, and whit...
22910 Effective Its like now of day people just litter and lik...
26999 Effective electors choose the president and the vice pre...
25662 Effective if one occured, the election would be disrupte...
21308 Effective Also, acorrding to the Office of the Federal R...
8141 Effective In the booming world today, transportaion is e...
22296 Effective Lets say that their is someone in the school t...
850 Effective Dr. Huang states Your home Pc cant handle the ...
19205 Effective Paragraph 15 says "The Electoral College is wi...
20217 Effective Additionally, Plumer explains, "If you lived i...
16802 Effective last,popular votes/majority rule are in all a ...
32316 Effective I mean kids do have the strength to go around ...
21790 Effective First of all we don't know who picks the elect...
28378 Effective we can even go and buy cloth and give it to po...
25728 Effective Like the example in "whats wrong with the elec...
26801 Effective The Electoral College has been here since the ...
26876 Effective We as humans can not let fear stand in our way...
16592 Effective A president election is great but, there are s...
22946 Effective it will give them a harder drive to make good ...
21802 Effective If you were to question U.S. citizens about wh...
3640 Effective People are facinated with the Man on the Moon ...
'Adequate' predicted as 'Effective' : 651 examples.
discourse_effectiveness discourse_text
13611 Adequate to allow creativity,
34486 Adequate What you should probably do Is either ask diff...
24956 Adequate some people are blind,handicap etc Those peopl...
29484 Adequate Another problem that could arise is the condit...
20638 Adequate Some may say that being at home makes a studen...
... ... ...
34653 Adequate For example, when you need that push to do som...
11612 Adequate Students will get smarter and teachers would b...
24654 Adequate Therefore being comfortable can help your work...
30576 Adequate that they might of had in school, with friends...
12427 Adequate If the student picks they will be more willing...

651 rows × 2 columns

'Ineffective' predicted as 'Effective' : 83 examples.
discourse_effectiveness discourse_text
24436 Ineffective The author absolutely makes no point because y...
6588 Ineffective the heat/friction sensor pannels will produce ...
27613 Ineffective The people have the majority vote, so they sho...
10436 Ineffective They can use their creative side of their brai...
9401 Ineffective That's not even half of it either in other wor...
... ... ...
23705 Ineffective congress should consider the thoughts and opin...
19452 Ineffective America was made to give people freedom, so pe...
28562 Ineffective The Electoral College System allows certainity...
21386 Ineffective Driverless cars may be modern and use half of ...
16705 Ineffective But to me I think we shouldn't be doing commun...

83 rows × 2 columns


LogisticRegression(max_iter=1000, random_state=42)
from sklearn.feature_selection import chi2

N = 10
for discourse_effectiveness, effectiveness_id in sorted(category_to_id.items()):
    indices = np.argsort(model.coef_[effectiveness_id])
    feature_names = np.array(vec_tfidf.get_feature_names_out())[indices]
    unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
    bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
    
print("# '{}':".format(discourse_effectiveness))
print("  . Top unigrams:\n       . {}".format('\n       . '.join(unigrams)))
print("  . Top bigrams:\n       . {}".format('\n       . '.join(bigrams)))

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names = dfcopy['discourse_effectiveness'].unique()))
# 'Ineffective':
  . Top unigrams:
       . thats
       . dont
       . aliens
       . story
       . stuff
       . think
       . like
       . whats
       . ther
       . gonna
  . Top bigrams:
       . process place
       . positive negative
       . advantages limiting
       . many ways
       . makes people
       . would try
       . count votes
       . good bad
       . online class
       . going make
              precision    recall  f1-score   support

    Adequate       0.64      0.86      0.74      6910
 Ineffective       0.53      0.18      0.26      2170
   Effective       0.66      0.47      0.55      3053

    accuracy                           0.64     12133
   macro avg       0.61      0.50      0.52     12133
weighted avg       0.63      0.64      0.61     12133

So, this process gave us the strongest prediction so far, but not by much. A downside to all of these iterative model-tests is that it takes quite a long time to generate all of the scores – and adding the bigrams in for analysis makes it even worse. I’ll point out, once again, that this approach manages to identify >80% of Adequates, nearly 50% of Effectives, but only 1/6 Ineffectives.

Sklearn FeatureUnion Pipelines#

Having found baseline performances, the next logical step seems to be to try integrating those features and feeding them into a single predictor (or predictive ensemble). The first approach to doing so is using a pipeline that looks at both text and the engineered features, using appropriate transformations for each, then unionizing them to make all of the columns amenable to simultaneous analysis.

#Modules
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import KFold, GridSearchCV, StratifiedKFold
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import make_scorer
#Classes to pull text, numeric features
class TextTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y = None, *parg, **kwarg):
        return self
    
    def transform(self, X):
        return X[self.key]
    
class NonTextTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        return X[[self.key]]
#Central Pipeline
fp = Pipeline([
    ('features', FeatureUnion([
        ('model_tokens', Pipeline([
            ('transformer', TextTransformer(key = 'model_tokens')),
            ('vectorizer', TfidfVectorizer(ngram_range = (1,1), 
                                            sublinear_tf = True,
                                            analyzer = 'word', 
                                            norm = 'l2',
                                            max_df = 15000,
                                            min_df = 2,                              
                                            encoding = 'latin-1'))])
        ),
        ('informative_proportion', Pipeline([
            ('transformer', NonTextTransformer(key= 'informative_proportion'))])),
        ('proportion_E_tokens', Pipeline([
            ('transformer', NonTextTransformer(key= 'proportion_E_tokens'))])),
        ('bad_signs_score', Pipeline([
            ('transformer', NonTextTransformer(key= 'bad_signs_score'))])),
        ('Punctuation', Pipeline([
            ('transformer', NonTextTransformer(key= 'Punctuation'))])),
        ('discourse_type', Pipeline([
            ('transformer', NonTextTransformer(key= 'discourse_type'))
        ]))    
                                ])
    ),
#Here    
    ('xgb3', XGBClassifier(objective = 'multi:softmax', random_state = 42, num_class = 3))
         ])
#GridSearch, k-fold
param_grid = {'clf__n_estimators': np.linspace(1, 100, 10, dtype=int),
              'clf__min_samples_split': [3, 10],
              'clf__min_samples_leaf': [3],
              'clf__max_features': [7],
              'clf__max_depth': [None],
              'clf__criterion': ['gini'],
              'clf__bootstrap': [False]}

kfold = StratifiedKFold(n_splits=7)
scoring = {'Accuracy': 'accuracy', 'F1': 'f1_macro'}
refit = 'F1'

#XGBoostGridSearch
search_space = [
  {
    'xgb3__n_estimators': [50, 100, 150, 200],
    'xgb3__learning_rate': [0.01, 0.1, 0.2, 0.3],
    'xgb3__max_depth': range(3, 10),
    'xgb3__colsample_bytree': [i/10.0 for i in range(1, 3)],
    'xgb3__gamma': [i/10.0 for i in range(3)],
  }
]

kfold = KFold(n_splits=2, random_state=42, shuffle = True)

scoring = {'AUC':'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
#train_test_split
x = KFP_df[['model_tokens', 
            'informative_proportion', 
            'discourse_type',
           'bad_signs_score',
           'Punctuation',
           'proportion_E_tokens']]

y = KFP_df['discourse_effectiveness'].astype('category').cat.codes
x_train, x_test, y_train, y_test = tts(x, y, random_state = 42, test_size = 0.3)

Due to the runtime – and corresponding difficulties in terms of configuring the Jupyter-Book, I’m commenting out the execution of the Grid Search algorithm below.

Using the framework above, all that’s required to test different models is to specify them from ‘#here’ down. That is, replace the model with the desired one, and then fill the appropriate grid-search elements in. I’ve actually experimented with a variety of these, but just include one as an illustrative example. Likewise, I experimented with GridSearch, RandomSearch, and melange of other cross-fold validation tools, but I think this is sufficient to grasp the essence of the process. No need to rehash every single instance of failed experimentation.

The second approach is making non-text features into text that is appended to the word-vectors we pass to the model.

def integrated_text_features(x):
        
    Integrated_Text = []
    Integrated_Labels = []

    for index, row in x.iterrows():

        combined1 = ''
        combined1+= 'Bad Signs Score: {:}, Informative Proportion: {:}, Part of Text: {:}. '.format(row['bad_signs_score'],
                                                                                      round(row['informative_proportion'],3),
                                                                                      row['discourse_type'])
        combined1+= ''.join([i for i in row['model_tokens']])
        combined1 = combined1.lower()
        Integrated_Text.append(combined1)
        Integrated_Labels.append(row['discourse_effectiveness'])
        
    IText = pd.Series(Integrated_Text)
    ILab = pd.Series(Integrated_Labels)
    
    IPrepped = pd.concat([IText, ILab], axis = 1, keys = ['Text', 'Label'])
        
    return IPrepped


IPrepped_df = integrated_text_features(KFP_df)
IPrepped_df['Text'][1]
'bad signs score: 4, informative proportion: 0.043, part of text: 5. perspective think face natural think paragraphs talking think natural'

Along with the output above, I also experimented with using text that hadn’t been processed, using descriptors instead of numbers, and so on:

KFP_df2 = pd.read_csv("train.csv")
KFP_df2 = KFP_df2.merge(KFP_df, how = 'inner', left_index = True, right_index = True)
KFP_df2 = KFP_df2[['discourse_text', 
                   'discourse_effectiveness_x', 
                   'discourse_type_x', 
                   'informative_proportion', 
                   'bad_signs_score' ]]
KFP_df2.rename({'discourse_effectiveness_x': 'discourse_effectiveness',
               'discourse_type_x' : 'discourse_type',
               'discourse_text' : 'model_tokens'}, axis =1, inplace = True)

effectiveness_map = {0: 'Excellent',
                     1: 'Very Good', 
                     2: 'Good', 
                     3: 'Acceptable',
                     4: 'Poor',
                     5: 'Very Poor'}
KFP_df2['bad_signs_score'] = KFP_df2['bad_signs_score'].map(effectiveness_map)

IPrepped2 = integrated_text_features(KFP_df2)
IPrepped2['Text'][0]
"bad signs score: very good, informative proportion: 0.047, part of text: lead. hi, i'm isaac, i'm going to be writing about how this face on mars is a natural landform or if there is life on mars that made it. the story is about how nasa took a picture of mars and a face was seen on the planet. nasa doesn't know if the landform was created by life on mars, or if it is just a natural landform. "

This approach was inspired by a wonderful set of blog posts, so Tip o’ the Hat to Chris McCormick, Ken Gu, and the team at Multi-Modal Toolkit Link.

This appealed to me in that it was concise, and it gave me the impression of ‘giving the computer my opinion.’ Unfortunately – at least with respect to the models I was able to configure fully with Kaggle’s notebooks (more later)– when I compared my highest-performing neural network classifier’s performance on the original text and the annotated text, just leaving all of this out yielded (slightly) better accuracy.

Finally, I tried a model using the ‘doc2vec’ approach – in essence it extends the word-vectorization process to include ‘tags’ which add a layer of similarity grouping. When we split training and testing data, normally, the model we select treats the outcome as a tag in this sense. The model analyses all of the features, looks at the labels we’ve given it, and tries to define a function that sends inputs to the appropriate label (the dependent variable). In the doc2vec process, this is essentially done at the step of word-embedding. The model decides how to configure the vector space itself with a similar training process, linking documents of common tags in such a way that the vector space reflects that similarity, not just the similarity of words. The example below hopefully makes this process clear. Learn more at – and Tip o’ the Hat to– the following links:

Link1.

Link2.

Link3.

Link4.

#I'm reimporting the original file here so as not to add any unwanted modifications to the in-progress dataframe. 
df = pd.read_csv('train.csv')

#Some dependencies
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from sklearn import utils
from gensim.models.doc2vec import TaggedDocument
import multiprocessing

# All of the preprocessing used above, just condensed. 
df.discourse_text = df.discourse_text.apply(contractions.fix)
df.discourse_text = df.discourse_text.apply(strip_punctuation)
df.discourse_text = df.discourse_text.apply(lambda x: x.lower())
df.discourse_text = df.discourse_text.apply(lambda x: x.strip())
df.discourse_text = df.discourse_text.apply(strip_multiple_whitespaces)
df['discourse_text'] = df['discourse_text'].apply(remove_stopwords)
#First, we split the dataset, keeping the *final* target (effectiveness) in the test set. 
from sklearn.model_selection import train_test_split as tts
train, test = tts(df, test_size = 0.3, random_state = 42)

import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens
#Here, we indicate that all text sequences should be tagged with effectiveness Scores, to configure similarity based on outcome
train_tagged = train.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['discourse_text']), tags= [r.discourse_effectiveness]), axis=1)
test_tagged = test.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['discourse_text']), tags=[r.discourse_effectiveness]), axis=1)
#Now, we create the document vectors, and the algorithm learns to build the space in consideration of scores
cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])

for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words)) for doc in sents])
    return targets, regressors
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 4300757.57it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5125132.40it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 6398744.05it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 6400641.21it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5130003.97it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 6396468.94it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5123429.53it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5124645.75it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5154255.25it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 6453836.38it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5160662.34it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5160415.62it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5124402.46it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 4300586.22it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5161155.85it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5159922.24it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5159182.36it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5161896.30it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5162390.04it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5127810.61it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5161649.46it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5161155.85it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5125619.14it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5161155.85it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5130247.79it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 6398744.05it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5160909.08it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5168074.95it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 5128785.21it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 6451521.93it/s]
100%|███████████████████████████████████████████████████████████████████████| 25735/25735 [00:00<00:00, 4277747.93it/s]
# *Now* we split the training and testing instances for the classifier
y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

logreg = LogisticRegression(max_iter = 5000, n_jobs=1, C=1e5)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print(f"Testing Accuracy: {accuracy_score(y_test, y_pred)}")
pred = logreg.predict_proba(X_test)
evaluate = log_loss(y_test , pred) 
print(f"Log-Loss = {round(evaluate, 3)}")
Testing Accuracy: 0.6043517679057117
Log-Loss = 0.923

This is an interesting idea, and there’s research suggesting it can be implemented to impressive effect. However, it’s not something I’ve explored in much detail, and I don’t feel like I understand it particularly well. That being the case, I’m not at all convinced the model couldn’t be coaxed into performing better if it was tuned and customized by someone with greater mastery of this technique.